View PDF ↗
PDF Viewer

Loading PDF...

This may take a moment

BUILDER'S SANDBOX

Core Pattern

AI-generated implementation pattern based on this paper's core methodology.

Understand the commercial significance and market impact.

7-day free trial. Cancel anytime.

Estimated $9K - $13K over 6-10 weeks.

See exactly what it costs to build this -- with 3 comparable funded startups.

7-day free trial. Cancel anytime.

Discover the researchers behind this paper and find similar experts.

7-day free trial. Cancel anytime.

Founder's Pitch

"A novel gradient regularization technique to prevent reward hacking in reinforcement learning models."

Reinforcement LearningScore: 2View PDF ↗

Commercial Viability Breakdown

0-10 scale

High Potential

0/4 signals

0

Quick Build

0/4 signals

0

Series A Potential

0/4 signals

0

Explore the full citation network and related research.

7-day free trial. Cancel anytime.

Understand the commercial significance and market impact.

7-day free trial. Cancel anytime.

Get detailed profiles of the research team.

7-day free trial. Cancel anytime.

References (36)

[1]
Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback
2025Johannes Ackermann, Takashi Ishida et al.
[2]
One Token to Fool LLM-as-a-Judge
2025Yulai Zhao, Haolin Liu et al.
[3]
ReDit: Reward Dithering for Improved LLM Policy Optimization
2025Chenxing Wei, Jiarui Yu et al.
[4]
Flat Reward in Policy Parameter Space Implies Robust Reinforcement Learning
2025Hyun-Kyu Lee, Sungmin Yoon
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
2025Adam Suma, Samuel Dauncey
[6]
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
2024Noam Razin, Sadhika Malladi et al.
[7]
When Will Gradient Regularization Be Harmful?
2024Yang Zhao, Hao Zhang et al.
[8]
Understanding the performance gap between online and offline alignment algorithms
2024Yunhao Tang, D. Guo et al.
[9]
Learn Your Reference Model for Real Good Alignment
2024Alexey Gorbatovski, Boris Shaposhnikov et al.
[10]
The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization
2024Shengyi Huang, Michael Noukhovitch et al.
[11]
Teaching Large Language Models to Reason with Reinforcement Learning
2024Alex Havrilla, Yuqing Du et al.
[12]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
2024Zhihong Shao, Peiyi Wang et al.
[13]
KTO: Model Alignment as Prospect Theoretic Optimization
2024Kawin Ethayarajh, Winnie Xu et al.
[14]
Language Model Alignment with Elastic Reset
2023Michael Noukhovitch, Samuel Lavoie et al.
[15]
PLASTIC: Improving Input and Label Plasticity for Sample Efficient Reinforcement Learning
2023Hojoon Lee, Hanseul Cho et al.
[16]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
2023Lianmin Zheng, Wei-Lin Chiang et al.
[17]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
2023Rafael Rafailov, Archit Sharma et al.
[18]
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
2023Yann Dubois, Xuechen Li et al.
[19]
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
2023Stella Biderman, Hailey Schoelkopf et al.
[20]
Scaling Laws for Reward Model Overoptimization
2022Leo Gao, John Schulman et al.

Showing 20 of 36 references