AI Evaluation

Papers in AI Evaluation

10 papers

Agent-as-a-Judge
A survey paper exploring 'Agent-as-a-Judge' in AI evaluations, proposing a developmental taxonomy.
AI EvaluationViability: 3.0
NoReGeo: Non-Reasoning Geometry Benchmark
NoReGeo offers a benchmark to evaluate LLMs' intrinsic geometric understanding, highlighting gaps in current models.
AI EvaluationViability: 4.0
SpatialBench-UC: Uncertainty-Aware Evaluation of Spatial Prompt Following in Text-to-Image Generation
SpatialBench-UC offers a benchmark and tooling for validating spatial accuracy in text-to-image models, providing an opportunity to enhance model performance evaluation and optimization.
AI EvaluationViability: 8.0
Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks
Develop a benchmark suite for testing LLMs' reasoning ability on advanced computational theory tasks.
AI EvaluationViability: 5.0
When Wording Steers the Evaluation: Framing Bias in LLM judges
Develop protocols to mitigate framing bias in LLM-based evaluations for more reliable AI judgments.
AI EvaluationViability: 4.0
Improving Methodologies for Agentic Evaluations Across Domains: Leakage of Sensitive Information, Fraud and Cybersecurity Threats
Develop a comprehensive evaluation framework for testing AI agents across languages, cultures, and security challenges.
AI EvaluationViability: 3.0
Improving Methodologies for LLM Evaluations Across Global Languages
Creating a shared framework for evaluating AI model safety across global languages using methodological insights.
AI EvaluationViability: 3.0
Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior
Develop an AI system to identify and leverage stable evaluative fingerprints in LLM judges for improved AI assessment consistency.
AI EvaluationViability: 6.0
AlgBench: To What Extent Do Large Reasoning Models Understand Algorithms?
Benchmark tool for assessing Large Reasoning Models' algorithmic understanding.
AI EvaluationViability: 5.0
RIFT: Reordered Instruction Following Testbed To Evaluate Instruction Following in Singular Multistep Prompt Structures
RIFT evaluates LLMs' handling of reordered instructions, revealing challenges in non-sequential workflows.
AI EvaluationViability: 5.0