Recent advancements in reinforcement learning (RL) are increasingly focused on enhancing the efficiency and adaptability of large language models (LLMs) in real-world applications. Techniques like Just-In-Time Reinforcement Learning enable LLMs to optimize their policies during deployment without costly gradient updates, significantly reducing operational expenses. Meanwhile, methods such as contextual bandit learning for multi-turn code generation and checklist rewards for multi-step tool use are bridging the gap between online and offline RL, making these systems more robust and applicable to complex tasks. The introduction of event-aware world models aims to improve sample efficiency by leveraging event segmentation, while innovations like regret-guided search control enhance learning efficiency by prioritizing high-impact states. Collectively, these developments signal a shift toward more scalable, practical RL solutions that can address diverse commercial challenges, from automated coding to interactive AI agents, paving the way for broader deployment in various sectors.
Top papers
- Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates(8.0)
- Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation(8.0)
- Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching(8.0)
- Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL(7.0)
- Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text(7.0)
- CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use(7.0)
- Beyond Mode Elicitation: Diversity-Preserving Reinforcement Learning via Latent Diffusion Reasoner(7.0)
- Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning(7.0)
- Regret-Guided Search Control for Efficient Learning in AlphaZero(7.0)
- From Observations to Events: Event-Aware World Model for Reinforcement Learning(7.0)
- Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment(7.0)
- Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery(6.0)
- EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization(6.0)
- LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting(6.0)
- Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction(6.0)
- Self-Hinting Language Models Enhance Reinforcement Learning(6.0)
- TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning(6.0)
- Agile Reinforcement Learning through Separable Neural Architecture(6.0)
- Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers(6.0)
- Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training(6.0)
- Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning(6.0)
- Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning(6.0)
- Formal Synthesis of Certifiably Robust Neural Lyapunov-Barrier Certificates(6.0)
- GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL(6.0)
- POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration(6.0)
- Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks(6.0)
- State-Action Inpainting Diffuser for Continuous Control with Delay(6.0)
- QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning(6.0)
- Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow(6.0)
- RUMAD: Reinforcement-Unifying Multi-Agent Debate(6.0)
- Zero-Shot Instruction Following in RL via Structured LTL Representations(6.0)
- Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning(6.0)
- Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models(6.0)
- Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance(6.0)
- VLM-Guided Experience Replay(6.0)
- AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search(6.0)
- Intrinsic Reward Policy Optimization for Sparse-Reward Environments(6.0)
- Bridging Dynamics Gaps via Diffusion Schrödinger Bridge for Cross-Domain Reinforcement Learning(6.0)
- LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models(6.0)
- Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning(6.0)
- Mode-Dependent Rectification for Stable PPO Training(6.0)
- IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning(6.0)
- Reward Learning through Ranking Mean Squared Error(6.0)
- Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes(5.0)
- General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies(5.0)
- Ranking-aware Reinforcement Learning for Ordinal Ranking(5.0)
- Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling(5.0)
- APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition(5.0)
- Beyond Imitation: Reinforcement Learning for Active Latent Planning(5.0)
- Reward Redistribution for CVaR MDPs using a Bellman Operator on L-infinity(5.0)