AI Model Evaluation Comparison Hub

3 papers - avg viability 5.0

Reference Surfaces

Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry(5.0)
INSPECTOR uses small language model representations for efficient, interpretable evaluation, challenging the dominance of LLMs in evaluative tasks.
STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction(5.0)
STAR enhances model performance prediction by integrating statistical and agentic reasoning for significant accuracy improvements.