State of the Field
Recent advancements in AI interpretability are increasingly focused on enhancing the transparency and usability of complex models, addressing critical commercial challenges in sectors like healthcare and finance. New frameworks, such as Mixture of Concept Bottleneck Experts, allow for adaptable interpretability by combining multiple expert models, thereby balancing accuracy with user-specific needs. Concurrently, methods like Rationale Extraction with Knowledge Distillation are improving the interpretability of deep neural networks by enabling them to learn from more capable models, thus enhancing performance in high-stakes applications. The introduction of AgentXRay offers a novel approach to reconstructing workflows of agentic systems, making them more interpretable without needing access to internal parameters. Additionally, techniques that leverage representational geometry are providing predictive signals for generalization failures, which can inform model selection and deployment strategies. Collectively, these developments signal a shift towards more robust, user-friendly interpretability solutions that can effectively address the complexities of real-world applications.
Papers
1–6 of 6Mixture of Concept Bottleneck Experts
Concept Bottleneck Models (CBMs) promote interpretability by grounding predictions in human-understandable concepts. However, existing CBMs typically fix their task predictor to a single linear or Boo...
Learn from A Rationalist: Distilling Intermediate Interpretable Rationales
Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received increased attention. The general idea of rationale extraction ...
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction
Large Language Models have shown strong capabilities in complex problem solving, yet many agentic systems remain difficult to interpret and control due to opaque internal workflows. While some framewo...
Diagnosing Generalization Failures from Representational Geometry Markers
Generalization, the ability to perform well beyond the training context, is a hallmark of biological and artificial intelligence, yet anticipating unseen failures remains a central challenge. Conventi...
Certified Circuits: Stability Guarantees for Mechanistic Circuits
Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits - minimal sub...
Step-resolved data attribution for looped transformers
We study how individual training examples shape the internal computation of looped transformers, where a shared block is applied for $τ$ recurrent iterations to enable latent reasoning. Existing train...