Wenge Technology
Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

An empirical investigation demonstrates that iterative Direct Preference Optimization (DPO) can enhance Large Language Model (LLM) reasoning to levels comparable with state-of-the-art Reinforcement Learning (RL) methods, while substantially reducing computational costs by enabling training on a single 80GB GPU. The study also proposes an iterative framework for the mutual refinement of the LLM generator and its associated reward model.

View blog
Resources16
There are no more papers matching your filters at the moment.