Ask or search anything...

History

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Hot

Wenge Technology

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

28 Jul 2025

Yuqian Fu

Chinese Academy of Sciences Fudan University logo

Fudan University

An empirical investigation demonstrates that iterative Direct Preference Optimization (DPO) can enhance Large Language Model (LLM) reasoning to levels comparable with state-of-the-art Reinforcement Learning (RL) methods, while substantially reducing computational costs by enabling training on a single 80GB GPU. The study also proposes an iterative framework for the mutual refinement of the LLM generator and its associated reward model.

View blog

#computer-science #computation-and-language

Resources 16

746

There are no more papers matching your filters at the moment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Ask or search anything...

Events