Benchmarking LLMs

Trending

3papers

6.3viability

+100%30d

Papers

1–3 of 3

Research Paper·Mar 12, 2026

TopoBench: Benchmarking LLMs on Hard Topological Reasoning

Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large lang...

8.0 viability

Research Paper·Mar 16, 2026

CCTU: A Benchmark for Tool Use under Complex Constraints

Solving problems through tool use under explicit constraints constitutes a highly challenging yet unavoidable scenario for large language models (LLMs), requiring capabilities such as function calling...

7.0 viabilityHas code

Research Paper·Mar 19, 2026

GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

We introduce GAIN (Goal-Aligned Decision-Making under Imperfect Norms), a benchmark designed to evaluate how large language models (LLMs) balance adherence to norms against business goals. Existing be...

4.0 viability

Benchmarking LLMs

Papers

TopoBench: Benchmarking LLMs on Hard Topological Reasoning

CCTU: A Benchmark for Tool Use under Complex Constraints

GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

Filters