Reinforcement Learning

Trending
144papers
4.5viability
+100%30d

State of the Field

Recent advancements in reinforcement learning (RL) are increasingly focused on enhancing the efficiency and adaptability of large language models (LLMs) in real-world applications. Techniques like Just-In-Time Reinforcement Learning enable LLMs to optimize their policies during deployment without costly gradient updates, significantly reducing operational expenses. Meanwhile, methods such as contextual bandit learning for multi-turn code generation and checklist rewards for multi-step tool use are bridging the gap between online and offline RL, making these systems more robust and applicable to complex tasks. The introduction of event-aware world models aims to improve sample efficiency by leveraging event segmentation, while innovations like regret-guided search control enhance learning efficiency by prioritizing high-impact states. Collectively, these developments signal a shift toward more scalable, practical RL solutions that can address diverse commercial challenges, from automated coding to interactive AI agents, paving the way for broader deployment in various sectors.

Last updated Feb 28, 2026

Papers

1–10 of 50
Research Paper·Feb 3, 2026·B2BEducation

Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online...

8.0 viability
Research Paper·Feb 2, 2026

Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching

Diffusion policies are expressive yet incur high inference latency. Flow Matching (FM) enables one-step generation, but integrating it into Maximum Entropy Reinforcement Learning (MaxEnt RL) is challe...

8.0 viability
Research Paper·Jan 26, 2026

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) ...

8.0 viability
Research Paper·Feb 3, 2026·B2B

Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separa...

7.0 viability
Research Paper·Jan 30, 2026

Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment

We study offline reinforcement learning of style-conditioned policies using explicit style supervision via subtrajectory labeling functions. In this setting, aligning style with high task performance ...

7.0 viability
Research Paper·Feb 24, 2026

Regret-Guided Search Control for Efficient Learning in AlphaZero

Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning-efficient than humans. While RL agents require extensive self-play games to extract useful signals, human...

7.0 viability
Research Paper·Jan 27, 2026

From Observations to Events: Event-Aware World Model for Reinforcement Learning

While model-based reinforcement learning (MBRL) improves sample efficiency by learning world models from raw observations, existing methods struggle to generalize across structurally similar scenes an...

7.0 viability
Research Paper·Jan 30, 2026

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing v...

7.0 viability
Research Paper·Mar 2, 2026

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which empl...

7.0 viability
Research Paper·Feb 12, 2026

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains ...

7.0 viability
Page 1 of 5