Top papers
- Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions(5.0)
- MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs(5.0)
- Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning(5.0)
- SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy(5.0)