Benchmark Development

Trending

4papers

5.0viability

+100%30d

Papers

1–4 of 4

Research Paper·Mar 4, 2026

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions. However, assessing how well LLMs can follow ...

5.0 viability

Research Paper·Feb 16, 2026

MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs

AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Ex...

5.0 viability

Research Paper·Feb 23, 2026

Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning

Existing benchmarks for AI reasoning provide limited insight into how closely these capabilities resemble human reasoning in naturalistic contexts. We present an adaptation of the Watson & Holmes dete...

5.0 viability

Research Paper·Feb 26, 2026

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexi...

5.0 viability

Benchmark Development

Papers

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs

Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

Filters