State of the Field
The field of AI evaluation is currently focusing on refining benchmarks to better assess the capabilities of models in real-world scenarios. Recent work highlights the need for more nuanced evaluation frameworks, such as those that account for implicit requirements in human communication or the structural sensitivity of instruction-following tasks. For instance, new benchmarks like TSAQA and RIFT are designed to broaden the scope of evaluation beyond traditional tasks, addressing the complexities of time series analysis and the impact of prompt structure on model performance. Additionally, frameworks like Implicit Intelligence aim to measure how well AI agents can infer unstated user needs, while studies on LLMs reveal significant inconsistencies in evaluation judgments, suggesting that models may not be interchangeable. These developments indicate a shift toward more comprehensive and context-aware evaluation methods, which could enhance the reliability of AI systems in commercial applications such as automated content generation, decision support, and interactive agents.
Papers
1–10 of 25SpatialBench-UC: Uncertainty-Aware Evaluation of Spatial Prompt Following in Text-to-Image Generation
Evaluating whether text-to-image models follow explicit spatial instructions is difficult to automate. Object detectors may miss targets or return multiple plausible detections, and simple geometric t...
Who can we trust? LLM-as-a-jury for Comparative Assessment
Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely...
Implicit Intelligence -- Evaluating Agents on What Users Don't Say
Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agenti...
TSAQA: Time Series Analysis Question And Answering Benchmark
Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time ser...
Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior
LLM-as-judge systems promise scalable, consistent evaluation. We find the opposite: judges are consistent, but not with each other; they are consistent with themselves. Across 3,240 evaluations (9 jud...
Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater model...
Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory
Diagnosing the failure mechanisms of Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring critical intermediate hallucin...
Agentified Assessment of Logical Reasoning Agents
We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessm...
GENIUS: Generative Fluid Intelligence Evaluation Suite
Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accu...
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We intro...