State of the Field
Recent advancements in reinforcement learning (RL) are increasingly focused on enhancing the efficiency and adaptability of large language models (LLMs) in real-world applications. Techniques like Just-In-Time Reinforcement Learning enable LLMs to optimize their policies during deployment without costly gradient updates, significantly reducing operational expenses. Meanwhile, methods such as contextual bandit learning for multi-turn code generation and checklist rewards for multi-step tool use are bridging the gap between online and offline RL, making these systems more robust and applicable to complex tasks. The introduction of event-aware world models aims to improve sample efficiency by leveraging event segmentation, while innovations like regret-guided search control enhance learning efficiency by prioritizing high-impact states. Collectively, these developments signal a shift toward more scalable, practical RL solutions that can address diverse commercial challenges, from automated coding to interactive AI agents, paving the way for broader deployment in various sectors.
Papers
1–10 of 50Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation
Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online...
Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching
Diffusion policies are expressive yet incur high inference latency. Flow Matching (FM) enables one-step generation, but integrating it into Maximum Entropy Reinforcement Learning (MaxEnt RL) is challe...
Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates
While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) ...
Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL
Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separa...
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment
We study offline reinforcement learning of style-conditioned policies using explicit style supervision via subtrajectory labeling functions. In this setting, aligning style with high task performance ...
Regret-Guided Search Control for Efficient Learning in AlphaZero
Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning-efficient than humans. While RL agents require extensive self-play games to extract useful signals, human...
From Observations to Events: Event-Aware World Model for Reinforcement Learning
While model-based reinforcement learning (MBRL) improves sample efficiency by learning world models from raw observations, existing methods struggle to generalize across structurally similar scenes an...
Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing v...
Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning
Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which empl...
CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use
AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains ...