Papers
1–4 of 4Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions
Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions. However, assessing how well LLMs can follow ...
MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs
AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Ex...
Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning
Existing benchmarks for AI reasoning provide limited insight into how closely these capabilities resemble human reasoning in naturalistic contexts. We present an adaptation of the Watson & Holmes dete...
SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy
As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexi...