Papers
1–4 of 4LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While rece...
Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning
KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightw...
More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)
The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a mor...
TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference
Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal...