AI Benchmarking

10papers
5.8viability

State of the Field

Recent advancements in AI benchmarking are focusing on enhancing the evaluation of large language models (LLMs) across diverse real-world scenarios. New benchmarks like DSAEval and AgentDrive are addressing the complexities of data science and autonomous systems, respectively, by providing structured datasets that reflect the multifaceted nature of these fields. DSAEval evaluates LLMs on a wide array of data science tasks, revealing strengths in structured data but highlighting challenges in unstructured domains. Meanwhile, AgentDrive introduces a comprehensive dataset for autonomous driving scenarios, facilitating the training and assessment of reasoning capabilities in dynamic environments. The emergence of benchmarks such as Gaia2 and PhysicsMind further emphasizes the need for robust evaluations in asynchronous settings and physical reasoning, respectively. Collectively, these efforts aim to refine AI models for practical applications, addressing commercial needs in automation, data analysis, and decision-making, while also revealing critical gaps in current model capabilities that require further research and development.

Last updated Feb 27, 2026

Papers

1–10 of 10
Research Paper·Jan 20, 2026

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multip...

7.0 viability
Research Paper·Jan 23, 2026

AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems

The rapid advancement of large language models (LLMs) has sparked growing interest in their integration into autonomous systems for reasoning-driven perception, planning, and decision-making. However,...

7.0 viability
Research Paper·Jan 22, 2026

PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models

Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying p...

6.0 viability
Research Paper·Feb 25, 2026

ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization

Large language models are increasingly applied to operational decision-making where the underlying structure is constrained optimization. Existing benchmarks evaluate whether LLMs can formulate optimi...

6.0 viability
Research Paper·Jan 15, 2026

ARC Prize 2025: Technical Report

The ARC-AGI benchmark series serves as a critical measure of few-shot generalization on novel tasks, a core aspect of intelligence. The ARC Prize 2025 global competition targeted the newly released AR...

6.0 viability
Research Paper·Feb 12, 2026

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where env...

6.0 viability
Research Paper·Jan 29, 2026

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipeli...

5.0 viability
Research Paper·Feb 11, 2026

Bi-Level Prompt Optimization for Multimodal LLM-as-a-Judge

Large language models (LLMs) have become widely adopted as automated judges for evaluating AI-generated content. Despite their success, aligning LLM-based evaluations with human judgments remains chal...

5.0 viability
Research Paper·Mar 3, 2026

Valet: A Standardized Testbed of Traditional Imperfect-Information Card Games

AI algorithms for imperfect-information games are typically compared using performance metrics on individual games, making it difficult to assess robustness across game choices. Card games are a natur...

5.0 viability
Research Paper·Feb 4, 2026·B2B

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by t...

5.0 viability