State of Benchmark Development

4 papers · avg viability 5.0

Download CSV View topic page

Top papers

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions(5.0)
MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs(5.0)
Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning(5.0)
SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy(5.0)