State of the Field
Current research in model compression is increasingly focused on enhancing the efficiency of large language models (LLMs) while maintaining their performance. Recent work has introduced innovative techniques such as agent-guided pruning, which intelligently selects layers for pruning to preserve critical knowledge pathways, and family-aware quantization, which generates high-fidelity calibration data to mitigate accuracy loss during quantization. These advancements address the pressing commercial need for deploying LLMs on resource-constrained devices without sacrificing accuracy. Additionally, methods like Hessian Robust Quantization aim to stabilize low-bit quantization by reshaping the loss landscape, while quantization-aware unlearning tackles the challenge of removing sensitive information from models without full retraining. The field is shifting toward more adaptive and intelligent strategies that not only compress models but also ensure their reliability and integrity in real-world applications, indicating a maturation in the approach to model efficiency.
Papers
1–7 of 7LLMs can Compress LLMs: Adaptive Pruning by Agents
As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as Sparse...
FAQ: Mitigating Quantization Error via Regenerating Calibration Data with Family-Aware Quantization
Although post-training quantization (PTQ) provides an efficient numerical compression scheme for deploying large language models (LLMs) on resource-constrained devices, the representativeness and univ...
HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning
Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error....
QUAIL: Quantization Aware Unlearning for Mitigating Misinformation in LLMs
Machine unlearning aims to remove specific knowledge (e.g., copyrighted or private data) from a trained model without full retraining. In practice, models are often quantized (e.g., 4-bit) for deploym...
Sparsity Induction for Accurate Post-Training Pruning of Large Language Models
Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency. Post-training sparsity (PTS)...
Elimination-compensation pruning for fully-connected neural networks
The unmatched ability of Deep Neural Networks in capturing complex patterns in large and noisy datasets is often associated with their large hypothesis space, and consequently to the vast amount of pa...
Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned ...