Recent research in large language model (LLM) evaluation is increasingly focused on understanding and mitigating the limitations of these systems in real-world applications. New benchmarks like CryptoAnalystBench and TRACK are designed to assess LLM performance in complex scenarios, revealing persistent failure modes that traditional metrics may overlook. The introduction of the ViHallu Challenge highlights the need for robust hallucination detection, particularly in low-resource languages, while frameworks like KNIGHT streamline the generation of assessment datasets, addressing the high costs associated with evaluation. Furthermore, methodologies such as ErrorMap and the Task-Specificity Score provide deeper insights into the reasons behind model failures, enabling developers to refine their systems more effectively. This collective shift toward more nuanced evaluation approaches not only enhances the reliability of LLMs but also addresses commercial challenges in deploying these models across diverse, high-stakes environments, ultimately paving the way for more trustworthy AI applications.
Top papers
- LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations(7.0)
- CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis(7.0)
- ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models(6.0)
- Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores(6.0)
- DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs(6.0)
- KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration(6.0)
- NanoKnow: How to Know What Your Language Model Knows(6.0)
- SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding(5.0)
- Judge Reliability Harness: Stress Testing the Reliability of LLM Judges(5.0)
- Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge(5.0)
- Even GPT-5.2 Can't Count to Five: The Case for Zero-Error Horizons in Trustworthy LLMs(5.0)
- Task--Specificity Score: Measuring How Much Instructions Really Matter for Supervision(5.0)
- SocialVeil: Probing Social Intelligence of Language Agents under Communication Barriers(5.0)
- DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science(5.0)
- From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning(5.0)
- Multilingual Large Language Models do not comprehend all natural languages to equal degrees(5.0)
- SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training(5.0)
- A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities(5.0)
- How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities(5.0)
- Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation(5.0)
- Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora(4.0)
- Quantifying construct validity in large language model evaluations(4.0)
- Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation(4.0)
- [Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games(4.0)
- PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation(4.0)
- Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems(4.0)
- Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess(3.0)
- Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks(3.0)
- Can LLMs interpret figurative language as humans do?: surface-level vs representational similarity(3.0)
- Self-Anchoring Calibration Drift in Large Language Models: How Multi-Turn Conversations Reshape Model Confidence(2.0)
- Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek(2.0)
- Within-Model vs Between-Prompt Variability in Large Language Models for Creative Tasks(2.0)
- Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs(2.0)