State of the Field
Recent research in large language model (LLM) evaluation is increasingly focused on understanding and mitigating the limitations of these systems in real-world applications. New benchmarks like CryptoAnalystBench and TRACK are designed to assess LLM performance in complex scenarios, revealing persistent failure modes that traditional metrics may overlook. The introduction of the ViHallu Challenge highlights the need for robust hallucination detection, particularly in low-resource languages, while frameworks like KNIGHT streamline the generation of assessment datasets, addressing the high costs associated with evaluation. Furthermore, methodologies such as ErrorMap and the Task-Specificity Score provide deeper insights into the reasons behind model failures, enabling developers to refine their systems more effectively. This collective shift toward more nuanced evaluation approaches not only enhances the reliability of LLMs but also addresses commercial challenges in deploying these models across diverse, high-stakes environments, ultimately paving the way for more trustworthy AI applications.
Papers
1–10 of 31LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations
Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce...
CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis
Modern analyst agents must reason over complex, high token inputs, including dozens of retrieved documents, tool outputs, and time sensitive data. While prior work has produced tool calling benchmarks...
DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs
The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations -- fluent, plausible-sounding outputs that c...
Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores
Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are...
ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models
Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rath...
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time...
NanoKnow: How to Know What Your Language Model Knows
How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" -- unknown or inaccessible. The recent release of n...
SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding
Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting lett...
Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge
A common solution for mitigating outdated or incorrect information in Large Language Models (LLMs) is to provide updated facts in-context or through knowledge editing. However, these methods introduce...
Even GPT-5.2 Can't Count to Five: The Case for Zero-Error Horizons in Trustworthy LLMs
We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors. While ZEH itself is simple, we demonstrate that evaluating the Z...