State of the Field
Current research in benchmarking is increasingly focused on addressing gaps in evaluating machine learning and AI models across diverse applications, from data lakes to scientific reasoning. Recent work has introduced several comprehensive benchmarks that emphasize real-world relevance and task complexity. For instance, benchmarks like LakeMLB and OctoBench provide structured environments for assessing performance in data integration and scaffold-aware coding, respectively, while GISA and FrontierScience target information-seeking and expert-level scientific reasoning tasks. These efforts highlight a shift towards more nuanced evaluation criteria that go beyond traditional metrics, such as incorporating memory-driven mechanisms in scientific reasoning with tools like $A^3$-Bench. Additionally, benchmarks such as AstroReason-Bench and XCR-Bench are designed to evaluate agentic planning and cultural reasoning, respectively, revealing significant performance gaps in existing models. This trend suggests a growing recognition of the need for standardized, contextually rich benchmarks to better inform the development of robust, application-ready AI systems.
Papers
1–10 of 21LakeMLB: Data Lake Machine Learning Benchmark
Modern data lakes have emerged as foundational platforms for large-scale machine learning, enabling flexible storage of heterogeneous data and structured analytics through table-oriented abstractions....
OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding
Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and pe...
TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health
While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domai...
ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design
While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs' performance in rea...
$A^3$-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation
Scientific reasoning relies not only on logical inference but also on activating prior knowledge and experiential structures. Memory can efficiently reuse knowledge and enhance reasoning consistency a...
GISA: A Benchmark for General Information-Seeking Assistant
The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Variou...
Bench-MFG: A Benchmark Suite for Learning in Stationary Mean Field Games
The intersection of Mean Field Games (MFGs) and Reinforcement Learning (RL) has fostered a growing family of algorithms designed to solve large-scale multi-agent systems. However, the field currently ...
CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching
As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious corre...
AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems
Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely foc...
XCR-Bench: A Multi-Task Benchmark for Evaluating Cultural Reasoning in LLMs
Cross-cultural competence in large language models (LLMs) requires the ability to identify Culture-Specific Items (CSIs) and to adapt them appropriately across cultural contexts. Progress in evaluatin...