Tag: evaluation

Research Papers

ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases

Ziqian Zhong, Aditi Raghunathan, Nicholas Carlini • October 20, 2025

This paper introduces ImpossibleBench, a benchmark framework to quantify an LLM's propensity to exploit test cases. We create "impossible" variants of coding tasks by mutating test cases to conflict with natural-language specifications, measuring an agent's "cheating rate" as its pass rate on these impossible tasks.

Are Today's LLMs Ready to Explain Well-Being Concepts?

Bohan Jiang, Dawei Li, Zhen Tan, Chengshuai Zhao, Huan Liu • August 5, 2025

This paper investigates whether Large Language Models (LLMs) can generate high-quality explanations of well-being concepts that are tailored to diverse audiences. The research constructs a large-scale dataset of 43,880 explanations from 10 diverse LLMs for 2,194 well-being concepts, and introduces a principle-guided LLM-as-a-judge evaluation framework.