Benchmarking LLMs Comparison Hub

TopoBench: Benchmarking LLMs on Hard Topological Reasoning(8.0)

TopoBench is a benchmark for evaluating LLMs on challenging topological reasoning tasks.

CCTU: A Benchmark for Tool Use under Complex Constraints(7.0)

CCTU is a benchmark designed to evaluate large language models' tool use under complex constraints, revealing critical limitations in their performance.

GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms(4.0)

GAIN is a benchmark for evaluating large language models' decision-making in real-world business contexts.

Benchmarking LLMs Comparison Hub

Reference Surfaces

Top Papers