AI Interpretability Comparison Hub
7 papers - avg viability 4.6
Current research in AI interpretability is increasingly focused on enhancing the transparency and understanding of complex models, particularly in high-stakes applications. Recent work emphasizes frameworks that allow for more adaptable and user-friendly interpretations, such as Mixture of Concept Bottleneck Experts, which balances accuracy and interpretability by leveraging multiple expert models. Additionally, techniques like AgentXRay aim to reconstruct workflows of agentic systems, making previously opaque processes more accessible to users. The introduction of methodologies such as Rationale Extraction with Knowledge Distillation further highlights the trend of distilling interpretable rationales from deep neural networks, improving their usability. Meanwhile, approaches that utilize representational geometry to predict generalization failures and the development of Certified Circuits for stable circuit discovery are addressing the reliability of interpretability methods. Collectively, these advancements are paving the way for more robust, clinically-informed frameworks that not only diagnose but also treat issues within AI models, fostering a deeper understanding of their internal workings.
Top Papers
- Mixture of Concept Bottleneck Experts(6.0)
Develop a flexible framework using Mixture of Concept Bottleneck Experts to enhance model interpretability and adaptability for specific user needs.
- AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction(5.0)
AgentXRay transforms opaque agentic systems into interpretable workflows using a search-based optimization framework.
- Learn from A Rationalist: Distilling Intermediate Interpretable Rationales(5.0)
REKD enhances student DNN interpretability using rationale extraction with knowledge distillation for improved AI decision-making.
- Diagnosing Generalization Failures from Representational Geometry Markers(5.0)
Develop a predictive tool for diagnosing AI generalization failures using representational geometry markers.
- Certified Circuits: Stability Guarantees for Mechanistic Circuits(5.0)
Develop a tool for mechanistic interpretability that provides stability guarantees for neural circuit discovery.
- Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models(4.0)
Develop a clinical framework for diagnosing and treating AI model disorders with Model Medicine.
- Step-resolved data attribution for looped transformers(2.0)
Introducing Step-Decomposed Influence for looped transformers to provide per-step data attribution insights.