LLM Inference Optimization Comparison Hub
6 papers - avg viability 5.8
Recent advancements in large language model (LLM) inference optimization are focusing on enhancing speed and efficiency while maintaining performance. Techniques such as speculative decoding are being refined, with frameworks like ConFu enabling draft models to anticipate future token generation, thereby improving acceptance rates and generation speed. Long-context models are also being targeted, with methods like LycheeDecode and LycheeCluster addressing memory and latency issues through innovative attention mechanisms and hierarchical cache management, achieving significant speedups without sacrificing quality. Furthermore, the use of key-value caches is being expanded beyond mere acceleration, allowing for adaptive reasoning and efficient sampling, as demonstrated by recent work that repurposes these caches for various downstream tasks. Collectively, these developments aim to solve commercial challenges related to computational costs and response times, making LLMs more viable for real-world applications while also paving the way for more intelligent and context-aware systems.
Top Papers
- LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding(7.0)
LycheeDecode accelerates long-context LLM inference with a novel hybrid-head sparse decoding to significantly reduce memory and latency costs while maintaining high generative quality.
- LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing(7.0)
LycheeCluster accelerates long-context LLM inference by 3.6x using structure-aware chunking and hierarchical KV indexing, offering a drop-in replacement for existing KV cache management.
- ConFu: Contemplate the Future for Better Speculative Sampling(7.0)
ConFu enhances speculative decoding for LLMs by enabling draft models to anticipate future token generation, improving speed and accuracy.
- Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning(6.0)
Leverage KV cache as a lightweight representation for efficient LLM inference, reducing computational cost without accuracy loss.
- More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)(5.0)
Optimize LLM inference cost-effectively using the Reset-and-Discard method.
- TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference(3.0)
TIDE enhances LLM inference with a serving-engine-native framework for online draft adaptation and speculative decoding.