AI Evaluation Tools Comparison Hub

3 papers - avg viability 6.3

Reference Surfaces

Automating Forecasting Question Generation and Resolution for AI Evaluation(7.0)
Automated AI system for scalable forecasting question generation and resolution with high accuracy.
What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding(5.0)
Develop a benchmarking tool to assess LLM agents' environment understanding using Task-to-Quiz paradigm.