Benchmarking LLMs Comparison Hub
3 papers - avg viability 6.3
Top Papers
- TopoBench: Benchmarking LLMs on Hard Topological Reasoning(8.0)
TopoBench is a benchmark for evaluating LLMs on challenging topological reasoning tasks.
- CCTU: A Benchmark for Tool Use under Complex Constraints(7.0)
CCTU is a benchmark designed to evaluate large language models' tool use under complex constraints, revealing critical limitations in their performance.
- GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms(4.0)
GAIN is a benchmark for evaluating large language models' decision-making in real-world business contexts.