Model Interpretability Comparison Hub
4 papers - avg viability 4.3
Current research in model interpretability is increasingly focused on overcoming the limitations of black-box systems, particularly in visual recognition and tabular data analysis. Recent work has introduced frameworks like UNBOX, which enables class-wise model dissection without internal access, thereby enhancing trustworthiness in visual models by revealing implicit biases and learned concepts. Similarly, ExplainerPFN offers a zero-shot method for estimating feature importance in tabular datasets, allowing practitioners to gain insights without needing direct model access, thus addressing scalability issues. In the realm of language models, DLM-Scope applies sparse autoencoders to enhance mechanistic interpretability, while Concept Influence provides a more efficient means of attributing model behavior to training data. Collectively, these advancements suggest a shift toward more accessible and scalable interpretability tools, which could significantly improve accountability and performance in real-world applications across various domains, including healthcare, finance, and automated systems.
Top Papers
- ExplainerPFN: Towards tabular foundation models for model-free zero-shot feature importance estimations(5.0)
ExplainerPFN provides model-free zero-shot feature importance estimations for tabular data using pretrained synthetic dataset models.
- DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders(5.0)
Develop an interpretability framework for Diffusion Language Models using Sparse Autoencoders to enhance feature extraction and intervention capabilities.
- Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution(5.0)
Leverage Concept Influence to efficiently attribute language model behaviors to training data for improved model control.
- The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models(2.0)
Explore the geometric structure of correctness in language models to understand internal error detection mechanisms.