LLM Evaluation

Trending

31papers

4.5viability

+58%30d

State of the Field

Recent research in large language model (LLM) evaluation is increasingly focused on understanding and mitigating the limitations of these systems in real-world applications. New benchmarks like CryptoAnalystBench and TRACK are designed to assess LLM performance in complex scenarios, revealing persistent failure modes that traditional metrics may overlook. The introduction of the ViHallu Challenge highlights the need for robust hallucination detection, particularly in low-resource languages, while frameworks like KNIGHT streamline the generation of assessment datasets, addressing the high costs associated with evaluation. Furthermore, methodologies such as ErrorMap and the Task-Specificity Score provide deeper insights into the reasons behind model failures, enabling developers to refine their systems more effectively. This collective shift toward more nuanced evaluation approaches not only enhances the reliability of LLMs but also addresses commercial challenges in deploying these models across diverse, high-stakes environments, ultimately paving the way for more trustworthy AI applications.

Last updated Mar 1, 2026

Papers

1–10 of 31

Research Paper·Mar 2, 2026

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce...

7.0 viability

Research Paper·Feb 11, 2026

CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis

Modern analyst agents must reason over complex, high token inputs, including dozens of retrieved documents, tool outputs, and time sensitive data. While prior work has produced tool calling benchmarks...

7.0 viability

Research Paper·Jan 8, 2026

DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs

The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations -- fluent, plausible-sounding outputs that c...

6.0 viability

Research Paper·Jan 20, 2026

Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are...

6.0 viability

Research Paper·Jan 22, 2026

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rath...

6.0 viability

Research Paper·Feb 23, 2026

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time...

6.0 viability

Research Paper·Feb 23, 2026

NanoKnow: How to Know What Your Language Model Knows

How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" -- unknown or inaccessible. The recent release of n...

6.0 viability

Research Paper·Jan 14, 2026

SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding

Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting lett...

5.0 viability

Research Paper·Jan 21, 2026

Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge

A common solution for mitigating outdated or incorrect information in Large Language Models (LLMs) is to provide updated facts in-context or through knowledge editing. However, these methods introduce...

5.0 viability

Research Paper·Jan 22, 2026

Even GPT-5.2 Can't Count to Five: The Case for Zero-Error Horizons in Trustworthy LLMs

We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors. While ZEH itself is simple, we demonstrate that evaluating the Z...

5.0 viability

Page 1 of 4

LLM Evaluation

State of the Field

Papers

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations

CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis

DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs

Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

NanoKnow: How to Know What Your Language Model Knows

SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding

Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge

Even GPT-5.2 Can't Count to Five: The Case for Zero-Error Horizons in Trustworthy LLMs

Filters