Papers
1–4 of 4CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning
While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they ar...
Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning
We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems...
LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
Long-term memory is fundamental for personalized agents capable of accumulating knowledge, reasoning over user experiences, and adapting across time. However, existing memory benchmarks primarily targ...
SourceBench: Can AI Answers Reference Quality Web Sources?
Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmar...