TopoBench: Benchmarking LLMs on Hard Topological Reasoning

PDF Viewer

BUILDER'S SANDBOX

Build This Paper

Use an AI coding agent to implement this research.

OpenAI Codex
OpenAI CodexAI Agent

Lightweight coding agent in your terminal.

Claude Code
Claude CodeAI Agent

Agentic coding tool for terminal workflows.

AntiGravity IDE
AntiGravity IDEScaffolding

AI agent mindset installer and workflow scaffolder.

Cursor
CursorIDE

AI-first code editor built on VS Code.

VS Code
VS CodeIDE

Free, open-source editor by Microsoft.

Estimated $10K - $14K over 6-10 weeks.

See exactly what it costs to build this -- with 3 comparable funded startups.

7-day free trial. Cancel anytime.

Discover the researchers behind this paper and find similar experts.

7-day free trial. Cancel anytime.

References (38)

[1]
SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models
2026Sebastiano Monti, Carlo Nicolini et al.
[2]
Exploring State Tracking Capabilities of Large Language Models
2025Kiamehr Rezaee, José Camacho-Collados et al.
[3]
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
2025Yitao Long, Yuru Jiang et al.
[4]
A Single Character can Make or Break Your LLM Evals
2025Jingtong Su, Jianyu Zhang et al.
[5]
Tracking World States with Language Models: State-Based Evaluation Using Chess
2025Romain Harang, Jason Naradowsky et al.
[6]
Mercury: Ultra-Fast Language Models Based on Diffusion
2025Samar Khanna, Siddhant Kharbanda et al.
[7]
Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles
2025Jiangjie Chen, Qianyu He et al.
[8]
Sudoku-Bench: Evaluating creative reasoning with Sudoku variants
2025Jeffrey Seely, Yuki Imajuku et al.
[9]
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
2025Jiajun Shi, Jian Yang et al.
[10]
SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
2025Anjiang Wei, Yuheng Wu et al.
[11]
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
2025François Chollet, Mike Knoop et al.
[12]
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation
2025Jixuan Leng, Chengsong Huang et al.
[13]
VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models
2025Yufan Ren, Konstantinos Tertikas et al.
[14]
Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?
2024Nemika Tyagi, Mihir Parmar et al.
[15]
PUZZLES: A Benchmark for Neural Algorithmic Reasoning
2024Benjamin Estermann, Luca A. Lanzendörfer et al.
[16]
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
2024Aaditya K. Singh, DJ Strouse
[17]
Puzzle Solving using Reasoning of Large Language Models: A Survey
2024Panagiotis Giadikiaroglou, Maria Lymperaiou et al.
[18]
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
2023Seungone Kim, Jamin Shin et al.
[19]
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
2023Neel Nanda, Andrew Lee et al.
[20]
Measuring Faithfulness in Chain-of-Thought Reasoning
2023Tamera Lanham, Anna Chen et al.

Showing 20 of 38 references

Founder's Pitch

"TopoBench is a benchmark for evaluating LLMs on challenging topological reasoning tasks."

Benchmarking LLMsScore: 8View PDF ↗

Commercial Viability Breakdown

0-10 scale

High Potential

2/4 signals

5

Quick Build

3/4 signals

7.5

Series A Potential

1/4 signals

2.5

Sources used for this analysis

arXiv Paper

Full-text PDF analysis of the research paper

GitHub Repository

Code availability, stars, and contributor activity

Citation Network

Semantic Scholar citations and co-citation patterns

Community Predictions

Crowd-sourced unicorn probability assessments

Analysis model: GPT-4o · Last scored: 3/12/2026

Explore the full citation network and related research.

7-day free trial. Cancel anytime.

Understand the commercial significance and market impact.

7-day free trial. Cancel anytime.

Get detailed profiles of the research team.

7-day free trial. Cancel anytime.

Related Papers

Loading…