LLM Optimization

37papers

5.4viability

-58%30d

State of the Field

Recent advancements in large language model (LLM) optimization are focused on enhancing efficiency and adaptability in enterprise applications, addressing the challenges of scalability and resource constraints. Automated frameworks like OptiKIT are streamlining model optimization, enabling non-expert teams to achieve significant improvements in GPU utilization and performance without deep technical knowledge. Meanwhile, Causal Prompt Optimization is reshaping how prompts are designed, allowing for tailored responses that adapt to specific queries, thereby reducing inference costs while enhancing robustness. Additionally, frameworks such as ALTER and HeteroCache are tackling the complexities of unlearning and memory management, ensuring that models can forget unwanted information without sacrificing utility. Innovations like PROTEUS are introducing sophisticated routing mechanisms that align model performance with operational targets, while token-level collaboration strategies in FusionRoute are optimizing multi-LLM interactions. Collectively, these developments are poised to resolve pressing commercial challenges, making LLMs more efficient and responsive to diverse enterprise needs.

Last updated Mar 5, 2026

Papers

1–10 of 37

Research Paper·Jan 28, 2026

Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT

Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expe...

9.0 viability

Research Paper·Mar 2, 2026

ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a LLMs should not know is important for ensuring alignment and thus safe use. H...

8.0 viability

Research Paper·Jan 19, 2026

LLM-as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction

Large language models are strong sequence predictors, yet standard inference relies on immutable context histories. After making an error at generation step t, the model lacks an updatable memory mech...

8.0 viability

Research Paper·Feb 2, 2026

Optimizing Prompts for Large Language Models: A Causal Approach

Large Language Models (LLMs) are increasingly embedded in enterprise workflows, yet their performance remains highly sensitive to prompt design. Automatic Prompt Optimization (APO) seeks to mitigate t...

8.0 viability

Research Paper·Jan 8, 2026

Token-Level LLM Collaboration via FusionRoute

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to size...

7.0 viability

Research Paper·Jan 27, 2026

PROTEUS: SLA-Aware Routing via Lagrangian RL for Multi-LLM Serving Systems

Production LLM deployments serve diverse workloads where cost and quality requirements vary by customer tier, time of day, and query criticality. Model serving systems accept latency SLOs directly. LL...

7.0 viability

Research Paper·Feb 10, 2026

LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of...

7.0 viability

Research Paper·Jan 20, 2026

HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important informati...

7.0 viability

Research Paper·Jan 21, 2026

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution sp...

7.0 viability

Research Paper·Feb 27, 2026

ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference

The paradigm of large language model (LLM) reasoning is shifting from parameter scaling to test-time compute scaling, yet many existing approaches still rely on uniform brute-force sampling (for examp...

7.0 viability

Page 1 of 4

LLM Optimization

State of the Field

Papers

Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT

ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs

LLM-as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction

Optimizing Prompts for Large Language Models: A Causal Approach

Token-Level LLM Collaboration via FusionRoute

PROTEUS: SLA-Aware Routing via Lagrangian RL for Multi-LLM Serving Systems

LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference

Filters