AI Evaluation

25papers

4.5viability

-33%30d

State of the Field

The field of AI evaluation is currently focusing on refining benchmarks to better assess the capabilities of models in real-world scenarios. Recent work highlights the need for more nuanced evaluation frameworks, such as those that account for implicit requirements in human communication or the structural sensitivity of instruction-following tasks. For instance, new benchmarks like TSAQA and RIFT are designed to broaden the scope of evaluation beyond traditional tasks, addressing the complexities of time series analysis and the impact of prompt structure on model performance. Additionally, frameworks like Implicit Intelligence aim to measure how well AI agents can infer unstated user needs, while studies on LLMs reveal significant inconsistencies in evaluation judgments, suggesting that models may not be interchangeable. These developments indicate a shift toward more comprehensive and context-aware evaluation methods, which could enhance the reliability of AI systems in commercial applications such as automated content generation, decision support, and interactive agents.

Last updated Feb 28, 2026

Papers

1–10 of 25

Research Paper·Jan 19, 2026

SpatialBench-UC: Uncertainty-Aware Evaluation of Spatial Prompt Following in Text-to-Image Generation

Evaluating whether text-to-image models follow explicit spatial instructions is difficult to automate. Object detectors may miss targets or return multiple plausible detections, and simple geometric t...

8.0 viability

Research Paper·Feb 18, 2026

Who can we trust? LLM-as-a-jury for Comparative Assessment

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely...

6.0 viability

Research Paper·Feb 23, 2026

Implicit Intelligence -- Evaluating Agents on What Users Don't Say

Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agenti...

6.0 viability

Research Paper·Jan 30, 2026

TSAQA: Time Series Analysis Question And Answering Benchmark

Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time ser...

6.0 viability

Research Paper·Jan 8, 2026

Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

LLM-as-judge systems promise scalable, consistent evaluation. We find the opposite: judges are consistent, but not with each other; they are consistent with themselves. Across 3,240 evaluations (9 jud...

6.0 viability

Research Paper·Feb 26, 2026

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater model...

5.0 viability

Research Paper·Jan 30, 2026

Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

Diagnosing the failure mechanisms of Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring critical intermediate hallucin...

5.0 viability

Research Paper·Mar 3, 2026

Agentified Assessment of Logical Reasoning Agents

We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessm...

5.0 viability

Research Paper·Feb 11, 2026

GENIUS: Generative Fluid Intelligence Evaluation Suite

Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accu...

5.0 viability

Research Paper·Feb 19, 2026

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We intro...

5.0 viability

Page 1 of 3

AI Evaluation

State of the Field

Papers

SpatialBench-UC: Uncertainty-Aware Evaluation of Spatial Prompt Following in Text-to-Image Generation

Who can we trust? LLM-as-a-jury for Comparative Assessment

Implicit Intelligence -- Evaluating Agents on What Users Don't Say

TSAQA: Time Series Analysis Question And Answering Benchmark

Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

Agentified Assessment of Logical Reasoning Agents

GENIUS: Generative Fluid Intelligence Evaluation Suite

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Filters