Benchmark Development Comparison Hub
4 papers - avg viability 5.0
Top Papers
- Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions(5.0)
Develop RealPref benchmark for evaluating LLMs in personalized preference-following tasks.
- MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs(5.0)
Develop MATEO as a benchmark tool to enhance temporal reasoning in large vision language models using multimodal data.
- Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning(5.0)
A benchmark for evaluating LLM reasoning in naturalistic contexts developed from a detective tabletop game.
- SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy(5.0)
Develop a multimodal benchmark for LLMs in scanning probe microscopy to assess and improve AI reasoning in specialized scientific domains.