State of the Field
Recent advancements in large language model (LLM) optimization are focused on enhancing efficiency and adaptability in enterprise applications, addressing the challenges of scalability and resource constraints. Automated frameworks like OptiKIT are streamlining model optimization, enabling non-expert teams to achieve significant improvements in GPU utilization and performance without deep technical knowledge. Meanwhile, Causal Prompt Optimization is reshaping how prompts are designed, allowing for tailored responses that adapt to specific queries, thereby reducing inference costs while enhancing robustness. Additionally, frameworks such as ALTER and HeteroCache are tackling the complexities of unlearning and memory management, ensuring that models can forget unwanted information without sacrificing utility. Innovations like PROTEUS are introducing sophisticated routing mechanisms that align model performance with operational targets, while token-level collaboration strategies in FusionRoute are optimizing multi-LLM interactions. Collectively, these developments are poised to resolve pressing commercial challenges, making LLMs more efficient and responsive to diverse enterprise needs.
Papers
1–10 of 37Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT
Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expe...
ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs
Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a LLMs should not know is important for ensuring alignment and thus safe use. H...
LLM-as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction
Large language models are strong sequence predictors, yet standard inference relies on immutable context histories. After making an error at generation step t, the model lacks an updatable memory mech...
Optimizing Prompts for Large Language Models: A Causal Approach
Large Language Models (LLMs) are increasingly embedded in enterprise workflows, yet their performance remains highly sensitive to prompt design. Automatic Prompt Optimization (APO) seeks to mitigate t...
Token-Level LLM Collaboration via FusionRoute
Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to size...
PROTEUS: SLA-Aware Routing via Lagrangian RL for Multi-LLM Serving Systems
Production LLM deployments serve diverse workloads where cost and quality requirements vary by customer tier, time of day, and query criticality. Model serving systems accept latency SLOs directly. LL...
LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations
Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of...
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference
The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important informati...
The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution sp...
ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference
The paradigm of large language model (LLM) reasoning is shifting from parameter scaling to test-time compute scaling, yet many existing approaches still rely on uniform brute-force sampling (for examp...