AI Model Evaluation Comparison Hub
3 papers - avg viability 5.0
Top Papers
- Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry(5.0)
INSPECTOR uses small language model representations for efficient, interpretable evaluation, challenging the dominance of LLMs in evaluative tasks.
- STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction(5.0)
STAR enhances model performance prediction by integrating statistical and agentic reasoning for significant accuracy improvements.