Papers
1–2 of 2Research Paper·Feb 5, 2026·B2BMedia & Entertainment
GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematica...
7.0 viability
Research Paper·Feb 3, 2026
Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals
Evaluating mathematical reasoning in LLMs is constrained by limited benchmark sizes and inherent model stochasticity, yielding high-variance accuracy estimates and unstable rankings across platforms. ...
5.0 viability