Ziqian Zhong, Aditi Raghunathan, Nicholas Carlini • October 20, 2025
This paper introduces ImpossibleBench, a benchmark framework to quantify an LLM's propensity to exploit test cases. We create "impossible" variants of coding tasks by mutating test cases to conflict with natural-language specifications, measuring an agent's "cheating rate" as its pass rate on these impossible tasks.
Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, Baolong Bi, Fangda Guo, Jiafeng Guo, Shenghua Liu, Xueqi Cheng • October 12, 2025
This survey provides a comprehensive review of "Vibe Coding," a paradigm where developers validate AI-generated code through outcome observation rather than line-by-line comprehension. We analyze over 1,000 research papers, examining LLMs for coding, coding agents, development environments, and feedback mechanisms.
Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, Zhangyang Wang • October 13, 2025
We investigate the "LLM Brain Rot Hypothesis"—that continual exposure to low-quality web text can induce cognitive decline in LLMs. Through controlled experiments on Twitter/X corpora, we demonstrate significant declines in reasoning, long-context understanding, and safety, while inflating negative traits.
Tarun Gupta, Danish Pruthi • February 16, 2025
We engage 13 experts to evaluate 50 AI-generated research documents for plagiarism. We find that 24% are either paraphrased or significantly borrowed from existing work without proper acknowledgment, highlighting the inadequacy of automated detectors and the need for careful assessment.
Bohan Jiang, Dawei Li, Zhen Tan, Chengshuai Zhao, Huan Liu • August 5, 2025
This paper investigates whether Large Language Models (LLMs) can generate high-quality explanations of well-being concepts that are tailored to diverse audiences. The research constructs a large-scale dataset of 43,880 explanations from 10 diverse LLMs for 2,194 well-being concepts, and introduces a principle-guided LLM-as-a-judge evaluation framework.
Research Team • May 27, 2024
This paper provides a comprehensive analysis of tokenization in large language models, exploring the fundamental mechanisms that enable LLMs to process and understand text. The research examines various tokenization strategies, their impact on model performance, and the implications for natural language processing tasks.