State of the Field
Recent research on improving the efficiency of large language models (LLMs) is focused on optimizing computational resources while maintaining performance. Techniques such as adaptive model selection and confidence-guided self-refinement are gaining traction, allowing systems to dynamically choose the most suitable model for specific tasks, significantly reducing inference costs. Innovations like the Collaborative Memory Transformer are addressing the challenges of long-context processing by enabling constant memory usage and linear time complexity, making LLMs more scalable. Additionally, hybrid architectures that combine sparse and linear attention mechanisms are emerging, achieving high fidelity in long-context modeling while enhancing efficiency. The introduction of novel quantization frameworks, such as residual-aware binarization training, is also pushing the boundaries of low-bit efficiency without sacrificing accuracy. These advancements are not only enhancing the practicality of deploying LLMs in commercial applications but also paving the way for more sustainable AI systems capable of handling complex tasks with reduced energy consumption.
Papers
1–10 of 11AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection
Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises...
CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute
Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a...
CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling
The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the Collaborative Memory...
Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
Large language models (LLMs) demonstrate superior reasoning capabilities compared to small language models (SLMs), but incur substantially higher costs. We propose COllaborative REAsoner (COREA), a sy...
MAR: Efficient Large Language Models via Module-aware Architecture Refinement
Large Language Models (LLMs) excel across diverse domains but suffer from high energy costs due to quadratic attention and dense Feed-Forward Network (FFN) operations. To address these issues, we prop...
Learning Generative Selection for Best-of-N
Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address t...
Residual Context Diffusion Language Models
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art ...
MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling
The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While exi...
Do LLMs Benefit From Their Own Words?
Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large ...
RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs
Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-frie...