LLM Efficiency

Trending
11papers
5.8viability
+75%30d

State of the Field

Recent research on improving the efficiency of large language models (LLMs) is focused on optimizing computational resources while maintaining performance. Techniques such as adaptive model selection and confidence-guided self-refinement are gaining traction, allowing systems to dynamically choose the most suitable model for specific tasks, significantly reducing inference costs. Innovations like the Collaborative Memory Transformer are addressing the challenges of long-context processing by enabling constant memory usage and linear time complexity, making LLMs more scalable. Additionally, hybrid architectures that combine sparse and linear attention mechanisms are emerging, achieving high fidelity in long-context modeling while enhancing efficiency. The introduction of novel quantization frameworks, such as residual-aware binarization training, is also pushing the boundaries of low-bit efficiency without sacrificing accuracy. These advancements are not only enhancing the practicality of deploying LLMs in commercial applications but also paving the way for more sustainable AI systems capable of handling complex tasks with reduced energy consumption.

Last updated Feb 20, 2026

Papers

1–10 of 11
Research Paper·Feb 12, 2026

AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises...

8.0 viability
Research Paper·Feb 9, 2026

CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute

Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a...

8.0 viability
Research Paper·Feb 2, 2026

CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling

The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the Collaborative Memory...

8.0 viability
Research Paper·Mar 4, 2026

Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning

Large language models (LLMs) demonstrate superior reasoning capabilities compared to small language models (SLMs), but incur substantially higher costs. We propose COllaborative REAsoner (COREA), a sy...

7.0 viability
Research Paper·Jan 29, 2026

MAR: Efficient Large Language Models via Module-aware Architecture Refinement

Large Language Models (LLMs) excel across diverse domains but suffer from high energy costs due to quadratic attention and dense Feed-Forward Network (FFN) operations. To address these issues, we prop...

7.0 viability
Research Paper·Feb 2, 2026

Learning Generative Selection for Best-of-N

Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address t...

7.0 viability
Research Paper·Jan 30, 2026

Residual Context Diffusion Language Models

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art ...

6.0 viability
Research Paper·Feb 12, 2026

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While exi...

5.0 viability
Research Paper·Feb 27, 2026

Do LLMs Benefit From Their Own Words?

Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large ...

3.0 viability
Research Paper·Feb 5, 2026·B2BConsumer

RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs

Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-frie...

3.0 viability
Page 1 of 2