State of the Field
Recent research in model optimization is increasingly focused on enhancing the efficiency and performance of large neural networks, addressing critical challenges in deployment and resource management. Techniques such as Prefill-Only Pruning are demonstrating how stage-aware strategies can significantly reduce computational costs during inference without sacrificing accuracy, while methods like FlashOptim are optimizing memory usage during training, making it feasible to work with larger models on limited hardware. Additionally, the emergence of adaptive frameworks, such as Routing the Lottery, is enabling the discovery of specialized subnetworks tailored to diverse data inputs, thereby improving model performance and reducing parameter counts. Innovations in quantization, exemplified by Quant Experts, are also refining how models handle memory and computational overhead, ensuring that large vision-language models remain effective even under constraints. Collectively, these advancements signal a shift toward more modular, efficient, and context-aware deep learning architectures, poised to meet the demands of real-world applications.
Papers
1–10 of 16POP: Prefill-Only Pruning for Efficient Large Model Inference
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured ...
Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from $k$-Parity
Masked Diffusion Language Models have recently emerged as a powerful generative paradigm, yet their generalization properties remain understudied compared to their auto-regressive counterparts. In thi...
When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging
Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving con...
Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data
In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterpar...
FlashOptim: Optimizers for Memory Efficient Training
Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and on...
Performance and Complexity Trade-off Optimization of Speech Models During Training
In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metr...
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights a...
SageBwd: A Trainable Low-bit Attention
Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduc...
Sink-Aware Pruning for Diffusion Language Models
Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typica...
EUGens: Efficient, Unified, and General Dense Layers
Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computa...