AI Infrastructure

6papers
5.7viability

State of the Field

Recent advancements in AI infrastructure are focusing on optimizing memory management and computational efficiency for large language models (LLMs). New frameworks like BudgetMem are enabling query-aware memory routing, allowing systems to balance performance and cost more effectively, which is crucial for applications requiring real-time responses. Meanwhile, innovations in caching techniques, such as native position-independent caching, are significantly reducing latency and improving throughput, addressing the inefficiencies of traditional prefix-based systems. Additionally, algorithms like Qrita are enhancing the efficiency of sampling operations by minimizing computational overhead on GPUs, which is vital for scaling LLM applications. These developments are not just technical improvements; they are poised to solve commercial challenges by enabling faster, more cost-effective deployment of AI solutions across industries, from customer service to content generation, thereby enhancing the overall utility of AI systems in real-world applications.

Last updated Feb 27, 2026

Papers

1–6 of 6
Research Paper·Feb 2, 2026

You Need an Encoder for Native Position-Independent Caching

The Key-Value (KV) cache of Large Language Models (LLMs) is prefix-based, making it highly inefficient for processing contexts retrieved in arbitrary order. Position-Independent Caching (PIC) has been...

8.0 viability
Research Paper·Feb 5, 2026·B2B

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be ...

8.0 viability
Research Paper·Feb 2, 2026

Qrita: High-performance Top-k and Top-p Algorithm for GPUs using Pivot-based Truncation and Selection

Top-k and Top-p are the dominant truncation operators in the sampling of large language models. Despite their widespread use, implementing them efficiently over large vocabularies remains a significan...

6.0 viability
Research Paper·Mar 2, 2026

TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easie...

6.0 viability
Research Paper·Jan 27, 2026

Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers

Scaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-awa...

3.0 viability
Research Paper·Feb 26, 2026

SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by token...

3.0 viability