Papers
1–3 of 3Research Paper·Feb 19, 2026
ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning
While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmat...
6.0 viability
Research Paper·Mar 5, 2026
MPCEval: A Benchmark for Multi-Party Conversation Generation
Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compa...
6.0 viability
Research Paper·Mar 2, 2026
When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation
Topic models uncover latent thematic structures in text corpora, yet evaluating their quality remains challenging, particularly in specialized domains. Existing methods often rely on automated metrics...
5.0 viability