A comprehensive study investigates the comparative performance of Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO) for large language model (LLM) alignment, revealing that a properly optimized PPO consistently achieves superior and more robust alignment across diverse tasks compared to DPO, which is shown to be susceptible to exploiting out-of-distribution responses.
View blogThis work by Gao et al. systematically investigates the use of reward models in Reinforcement Learning for Large Language Model reasoning, identifying a severe 'reward hacking' issue with process-supervised models. The study introduces Clip and Delta mechanisms to mitigate this problem, leading to consistent performance improvements across various LLMs, including state-of-the-art Qwen2.5-Math-7B-Instruct, with significant gains in sampling accuracy on mathematical reasoning benchmarks like MATH and GSM8K.
View blog