State of the Field
Recent advancements in AI infrastructure are focusing on optimizing memory management and computational efficiency for large language models (LLMs). New frameworks like BudgetMem are enabling query-aware memory routing, allowing systems to balance performance and cost more effectively, which is crucial for applications requiring real-time responses. Meanwhile, innovations in caching techniques, such as native position-independent caching, are significantly reducing latency and improving throughput, addressing the inefficiencies of traditional prefix-based systems. Additionally, algorithms like Qrita are enhancing the efficiency of sampling operations by minimizing computational overhead on GPUs, which is vital for scaling LLM applications. These developments are not just technical improvements; they are poised to solve commercial challenges by enabling faster, more cost-effective deployment of AI solutions across industries, from customer service to content generation, thereby enhancing the overall utility of AI systems in real-world applications.
Papers
1–6 of 6You Need an Encoder for Native Position-Independent Caching
The Key-Value (KV) cache of Large Language Models (LLMs) is prefix-based, making it highly inefficient for processing contexts retrieved in arbitrary order. Position-Independent Caching (PIC) has been...
Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be ...
Qrita: High-performance Top-k and Top-p Algorithm for GPUs using Pivot-based Truncation and Selection
Top-k and Top-p are the dominant truncation operators in the sampling of large language models. Despite their widespread use, implementing them efficiently over large vocabularies remains a significan...
TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easie...
Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers
Scaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-awa...
SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by token...