State of the Field
Recent advancements in benchmarking tools for large language models (LLMs) are addressing critical gaps in evaluating their performance across diverse software engineering tasks. New benchmarks like BIRD-Python and AdaptEval are specifically designed to assess the capabilities of LLMs in translating natural language to executable code and adapting existing code snippets, respectively. These tools emphasize the importance of context and procedural logic, revealing that performance often hinges on the model's ability to understand user intent and domain-specific knowledge. Meanwhile, TSRBench introduces a comprehensive framework for evaluating time series reasoning, highlighting the need for models to integrate multimodal inputs effectively. The development of BEHELM aims to unify evaluation metrics and datasets, ensuring a more robust assessment of LLMs in real-world applications. Collectively, these efforts reflect a shift towards more nuanced and practical evaluation methods, which could enhance the reliability and usability of LLMs in various commercial contexts, from data analysis to software development.
Papers
1–5 of 5Benchmarking Text-to-Python against Text-to-SQL: The Impact of Explicit Logic and Ambiguity
While Text-to-SQL remains the dominant approach for database interaction, real-world analytics increasingly require the flexibility of general-purpose programming languages such as Python or Pandas to...
TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models
Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is ...
Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering
Large language models for code are advancing fast, yet our ability to evaluate them lags behind. Current benchmarks focus on narrow tasks and single metrics, which hide critical gaps in robustness, in...
DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity
We present DRACO (Deep Research Accuracy, Completeness, and Objectivity), a benchmark of complex deep research tasks. These tasks, which span 10 domains and draw on information sources from 40 countri...
AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation
Recent advancements in large language models (LLMs) have automated various software engineering tasks, with benchmarks emerging to evaluate their capabilities. However, for adaptation, a critical acti...