Papers
1–3 of 3Research Paper·Feb 11, 2026
Retrieval-Aware Distillation for Transformer-SSM Hybrids
State-space models (SSMs) offer efficient sequence modeling but lag behind Transformers on benchmarks that require in-context retrieval. Prior work links this gap to a small set of attention heads, te...
6.0 viability
Research Paper·Feb 5, 2026·B2B
ZeroS: Zero-Sum Linear Attention for Efficient Transformers
Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction...
5.0 viability
Research Paper·Feb 5, 2026·B2B
Shiva-DiT: Residual-Based Differentiable Top-$k$ Selection for Efficient Diffusion Transformers
Diffusion Transformers (DiTs) incur prohibitive computational costs due to the quadratic scaling of self-attention. Existing pruning methods fail to simultaneously satisfy differentiability, efficienc...
5.0 viability