Evaluation Frameworks Comparison Hub
3 papers - avg viability 5.7
Top Papers
- GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?(7.0)
GenArena offers a human-aligned evaluation framework for visual generation tasks, significantly improving accuracy and simplifying model benchmarking.
- Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals(5.0)
Develop a statistically efficient evaluation framework for LLM mathematical reasoning with improved ranking accuracy using pairwise comparison signals.