Model Interpretability

Trending

4papers

4.3viability

+200%30d

Papers

1–4 of 4

Research Paper·Jan 30, 2026

ExplainerPFN: Towards tabular foundation models for model-free zero-shot feature importance estimations

Computing the importance of features in supervised classification tasks is critical for model interpretability. Shapley values are a widely used approach for explaining model predictions, but require ...

5.0 viability

Research Paper·Feb 5, 2026·B2BHealthcare

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable feat...

5.0 viability

Research Paper·Feb 16, 2026

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attri...

5.0 viability

Research Paper·Feb 8, 2026

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

When a language model asserts that "the capital of Australia is Sydney," does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture fam...

2.0 viability

Model Interpretability

Papers

ExplainerPFN: Towards tabular foundation models for model-free zero-shot feature importance estimations

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

Filters