Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Ivison, Hamish; Wang, Yizhong; Liu, Jiacheng; Wu, Zeqiu; Pyatkin, Valentina; Lambert, Nathan; Smith, Noah A.; Choi, Yejin; Hajishirzi, Hannaneh

Computer Science > Computation and Language

arXiv:2406.09279 (cs)

[Submitted on 13 Jun 2024 (v1), last revised 7 Oct 2024 (this version, v2)]

Title:Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Authors:Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi

View PDF HTML (experimental)

Abstract:Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. Despite significant gains of up to 5% in mathematical evaluation when scaling up reward models, we surprisingly observe marginal improvements in other categories.
We publicly release the code used for training (this https URL) and evaluating (this https URL) our models, along with the models and datasets themselves (this https URL).

Comments:	Neurips 2024 camera-ready
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2406.09279 [cs.CL]
	(or arXiv:2406.09279v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.09279

Submission history

From: Hamish Ivison [view email]
[v1] Thu, 13 Jun 2024 16:17:21 UTC (430 KB)
[v2] Mon, 7 Oct 2024 21:24:59 UTC (8,938 KB)

Computer Science > Computation and Language

Title:Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators