LLM Inference Optimization

4papers

5.3viability

Papers

1–4 of 4

Research Paper·Feb 4, 2026·B2B

LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While rece...

7.0 viability

Research Paper·Jan 28, 2026

Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning

KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightw...

6.0 viability

Research Paper·Jan 29, 2026

More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)

The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a mor...

5.0 viability

Research Paper·Feb 5, 2026·B2BConsumer

TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference

Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal...

3.0 viability