Beemo: Benchmark of Expert-edited Machine-generated Outputs

About this Event

The event will cover the paper: Beemo: Benchmark of Expert-edited Machine-generated Outputs

Abstract:

The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts (MGTs) and blurred text authorship in various domains. However, most existing MGT benchmarks include single-author texts (human-written and machine-generated). This conventional design fails to capture more practical multi-author scenarios, where the user refines the LLM response for natural flow, coherence, and factual correctness. Our paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by experts for various use cases, ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, allowing for diverse MGT detection evaluation across various edit types. We document Beemo’s creation protocol and present the results of benchmarking 33 configurations of MGT detectors in different experimental setups. We find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Beemo and all materials are publicly available.

Speakers

Jason S. Lucas is a Ph.D. candidate in Informatics (Data Science and AI) at The Pennsylvania State University, specializing in multilingual, multi-modal natural language processing and AI safety research. As an NSF LinDiV Fellow and Graduate Research Assistant at Penn State’s PIKE Research Lab, he leads research on combating disinformation, deepfake detection, and cross-lingual AI system vulnerabilities, with particular focus on low-resource languages and other long-tail distribution gaps. His work has been published at top-tier conferences including EMNLP, ACL, NAACL, and COLING, garnering over 150 citations. Lucas continues his research, exploring the dual role of various large AI models in both attacking and defending against harmful content that spans language and modality.