LLM Evaluation

Papers in LLM Evaluation

10 papers

SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding
SubTokenTest offers a benchmark to improve LLMs' real-world sub-token understanding, crucial for applications like text-based map navigation.
LLM EvaluationViability: 5.0
DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs
Build a platform for detecting and reducing hallucinations in Vietnamese language models using the ViHallu dataset and structured prompting techniques.
LLM EvaluationViability: 6.0
Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess
Explore chess to differentiate language model capacities for memory versus reasoning.
LLM EvaluationViability: 3.0
Can LLMs interpret figurative language as humans do?: surface-level vs representational similarity
Explore differences in LLM and human figurative language interpretation for improved language model alignment.
LLM EvaluationViability: 3.0
Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores
Adaptive LLM evaluation tool for efficient and reliable model ranking with minimal item usage.
LLM EvaluationViability: 6.0
Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora
Develop translation-aware contamination detection to ensure fair evaluation of multilingual LLMs.
LLM EvaluationViability: 4.0
ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models
Identify and categorize error types in LLMs to enhance debugging and model selection processes.
LLM EvaluationViability: 6.0
Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge
Introducing TRACK, a benchmark for evaluating LLMs' reasoning capabilities amidst conflicting knowledge updates.
LLM EvaluationViability: 5.0
PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation
A statistical framework (PRECISE) that integrates human annotations with LLM judgments to improve search and ranking system evaluations with less bias and reduced annotation needs.
LLM EvaluationViability: 4.0
Even GPT-5.2 Can't Count to Five: The Case for Zero-Error Horizons in Trustworthy LLMs
A tool to measure the error limits of language models for trustworthy deployment in safety-critical domains.
LLM EvaluationViability: 5.0