Model Compression

7papers
5.3viability
-25%30d

State of the Field

Current research in model compression is increasingly focused on enhancing the efficiency of large language models (LLMs) while maintaining their performance. Recent work has introduced innovative techniques such as agent-guided pruning, which intelligently selects layers for pruning to preserve critical knowledge pathways, and family-aware quantization, which generates high-fidelity calibration data to mitigate accuracy loss during quantization. These advancements address the pressing commercial need for deploying LLMs on resource-constrained devices without sacrificing accuracy. Additionally, methods like Hessian Robust Quantization aim to stabilize low-bit quantization by reshaping the loss landscape, while quantization-aware unlearning tackles the challenge of removing sensitive information from models without full retraining. The field is shifting toward more adaptive and intelligent strategies that not only compress models but also ensure their reliability and integrity in real-world applications, indicating a maturation in the approach to model efficiency.

Last updated Feb 28, 2026

Papers

1–7 of 7
Research Paper·Jan 14, 2026

LLMs can Compress LLMs: Adaptive Pruning by Agents

As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as Sparse...

8.0 viability
Research Paper·Jan 16, 2026

FAQ: Mitigating Quantization Error via Regenerating Calibration Data with Family-Aware Quantization

Although post-training quantization (PTQ) provides an efficient numerical compression scheme for deploying large language models (LLMs) on resource-constrained devices, the representativeness and univ...

7.0 viability
Research Paper·Jan 29, 2026

HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error....

5.0 viability
Research Paper·Jan 21, 2026

QUAIL: Quantization Aware Unlearning for Mitigating Misinformation in LLMs

Machine unlearning aims to remove specific knowledge (e.g., copyrighted or private data) from a trained model without full retraining. In practice, models are often quantized (e.g., 4-bit) for deploym...

5.0 viability
Research Paper·Feb 25, 2026

Sparsity Induction for Accurate Post-Training Pruning of Large Language Models

Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency. Post-training sparsity (PTS)...

5.0 viability
Research Paper·Feb 24, 2026

Elimination-compensation pruning for fully-connected neural networks

The unmatched ability of Deep Neural Networks in capturing complex patterns in large and noisy datasets is often associated with their large hypothesis space, and consequently to the vast amount of pa...

5.0 viability
Research Paper·Feb 19, 2026

Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned ...

2.0 viability