Reinforcement Learning

Papers in Reinforcement Learning

26 papers

AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search
Innovative reinforcement learning framework for enhancing multi-turn AI agents using tree search and policy optimization.
Reinforcement LearningViability: 6.0
On the Hidden Objective Biases of Group-based Reinforcement Learning
Introducing a theoretical analysis to address biases in group-based reinforcement learning methods.
Reinforcement LearningViability: 2.0
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Optimize multi-reward reinforcement learning with GDPO for stable and precise model training.
Reinforcement LearningViability: 4.0
Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
Golden Goose synthesizes RLVR tasks from unverifiable internet text, enabling robust gains in diverse domains including cybersecurity.
Reinforcement LearningViability: 7.0
Reward Learning through Ranking Mean Squared Error
Develop a reward learning tool utilizing Ranked Return Regression to enhance RL performance with minimal human feedback.
Reinforcement LearningViability: 6.0
Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates
JitRL offers cost-effective continual learning for LLM agents by optimizing policies without gradient updates, drastically reducing computational expenses.
Reinforcement LearningViability: 8.0
Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
Sparse-RL reduces memory overhead in RL training for LLMs by using sparse rollouts without performance loss.
Reinforcement LearningViability: 3.0
Projected Microbatch Accumulation yields reference-free proximal policy updates for reinforcement learning
PROMA enables stable proximal policy updates for reinforcement learning without entropy collapse.
Reinforcement LearningViability: 2.0
Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL
Revolutionizing long-horizon offline goal-conditioned RL with sequential subgoal autoregression.
Reinforcement LearningViability: 7.0
SuS: Strategy-aware Surprise for Intrinsic Exploration
Develop an intrinsic motivation framework for improving exploration in reinforcement learning.
Reinforcement LearningViability: 5.0
Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals
Developing adaptive RL policies through unsupervised goal-setting for diverse environments.
Reinforcement LearningViability: 4.0
POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration
Leverage Privileged On-Policy Exploration to enhance language model reasoning with oracle-guided RL.
Reinforcement LearningViability: 6.0
Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching
Accelerate RL with FLAME, delivering one-step flow matching for optimal policy efficiency and low latency.
Reinforcement LearningViability: 8.0
Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes
PrefixRL optimizes reinforcement learning efficiency on challenging problems by leveraging off-policy traces effectively.
Reinforcement LearningViability: 5.0
Q-learning with Adjoint Matching
A new reinforcement learning algorithm for optimizing flow policies without unstable backpropagation constraints.
Reinforcement LearningViability: 2.0
Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning
Benchmark and code for developing RL agents with enhanced memory updating abilities.
Reinforcement LearningViability: 6.0
CoScale-RL: Efficient Post-Training by Co-Scaling Data and Computation
Efficiently improve Large Reasoning Model performance with CoScale-RL's novel data and computational scaling strategy.
Reinforcement LearningViability: 5.0
Proximal Policy Optimization with Evolutionary Mutations
Integrate evolutionary mutations into PPO for improved exploration in reinforcement learning.
Reinforcement LearningViability: 4.0
Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data
Study reveals how outcome-based RL enables Transformers to develop reasoning but lacks commercial applicability.
Reinforcement LearningViability: 3.0
Safe Continual Reinforcement Learning Methods for Nonstationary Environments. Towards a Survey of the State of the Art
A comprehensive survey of safe continual reinforcement learning methods for nonstationary environments.
Reinforcement LearningViability: 2.0
Trust, Don't Trust, or Flip: Robust Preference-Based Reinforcement Learning with Multi-Expert Feedback
Robustify preference-based reinforcement learning against unreliable annotators through trust parameter optimization.
Reinforcement LearningViability: 2.0
Diffusion Model-based Reinforcement Learning for Version Age of Information Scheduling: Average and Tail-Risk-Sensitive Control
Develops a diffusion-based reinforcement learning approach to optimize Version Age of Information in real-time wireless systems.
Reinforcement LearningViability: 3.0
From Observations to Events: Event-Aware World Model for Reinforcement Learning
A new reinforcement learning framework that transforms sensory streams into event-driven representations for more efficient policy learning.
Reinforcement LearningViability: 7.0
TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning
TACLer provides a more efficient and accurate curriculum reinforcement learning framework for LLM reasoning tasks, significantly reducing computational costs while improving performance.
Reinforcement LearningViability: 6.0
Beyond Imitation: Reinforcement Learning for Active Latent Planning
ATP-Latent enhances latent reasoning in LLMs by integrating reinforcement learning for optimal planning.
Reinforcement LearningViability: 5.0
APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition
Adaptive Policy Composition (APC) optimizes reinforcement learning by dynamically integrating and leveraging suboptimal data-driven behavior priors.
Reinforcement LearningViability: 5.0