PDF Viewer

BUILDER'S SANDBOX

Build This Paper

Use an AI coding agent to implement this research.

OpenAI Codex
OpenAI CodexAI Agent

Lightweight coding agent in your terminal.

Claude Code
Claude CodeAI Agent

Agentic coding tool for terminal workflows.

AntiGravity IDE
AntiGravity IDEScaffolding

AI agent mindset installer and workflow scaffolder.

Cursor
CursorIDE

AI-first code editor built on VS Code.

VS Code
VS CodeIDE

Free, open-source editor by Microsoft.

Estimated $9K - $13K over 6-10 weeks.

See exactly what it costs to build this -- with 3 comparable funded startups.

7-day free trial. Cancel anytime.

Discover the researchers behind this paper and find similar experts.

7-day free trial. Cancel anytime.

References (44)

[1]
OpenAI GPT-5 System Card
2025Aaditya K. Singh, Adam Fry et al.
[2]
Textual understanding boost in the WikiRace
2025Raman Ebrahimi, Sean Fuhrman et al.
[3]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
2025Xiang Deng, Jeff Da et al.
[4]
Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents
2025Davide Paglieri, Bartlomiej Cupial et al.
[5]
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
2025Gheorghe Comanici, Eric Bieber et al.
[6]
Sudoku-Bench: Evaluating creative reasoning with Sudoku variants
2025Jeffrey Seely, Yuki Imajuku et al.
[7]
PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities
2025Haoming Li, Zhaoliang Chen et al.
[8]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
2025Qiying Yu, Zheng Zhang et al.
[9]
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
2025Rui Yang, Hanyang Chen et al.
[10]
Evaluating Large Language Models on Wikipedia Graph Navigation: Insights from the WikiGame
2025Daniele Margiotta, D. Croce et al.
[11]
A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model o1
2025Karthik Valmeekam, Kaya Stechly et al.
[12]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
2025Adam Suma, Samuel Dauncey
[13]
Plancraft: an evaluation dataset for planning with LLM agents
2024Gautier Dagan, Frank Keller et al.
[14]
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
2024Frank F. Xu, Yufan Song et al.
[15]
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
2024Sheng Yin, Xianghe Pang et al.
[16]
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
2024Davide Paglieri, Bartlomiej Cupial et al.
[17]
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks
2024Matthew Chang, Gunjan Chhablani et al.
[18]
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities
2024Ying Su, Zhan Ling et al.
[19]
LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
2024Karthik Valmeekam, Kaya Stechly et al.
[20]
The Llama 3 Herd of Models
2024Abhimanyu Dubey, Abhinav Jauhri et al.

Showing 20 of 44 references

Founder's Pitch

"LLM-Wikirace provides a benchmarking tool for evaluating LLMs on long-term planning and reasoning tasks using Wikipedia's knowledge graph."

BenchmarkingScore: 4View PDF ↗

Commercial Viability Breakdown

0-10 scale

High Potential

1/4 signals

2.5

Quick Build

2/4 signals

5

Series A Potential

0/4 signals

0

Sources used for this analysis

arXiv Paper

Full-text PDF analysis of the research paper

GitHub Repository

Code availability, stars, and contributor activity

Citation Network

Semantic Scholar citations and co-citation patterns

Community Predictions

Crowd-sourced unicorn probability assessments

Analysis model: GPT-4o · Last scored: 2/18/2026

Explore the full citation network and related research.

7-day free trial. Cancel anytime.

Understand the commercial significance and market impact.

7-day free trial. Cancel anytime.

Get detailed profiles of the research team.

7-day free trial. Cancel anytime.