Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Miao, Yanting; Loh, William; Kothawade, Suraj; Poupart, Pascal; Rashwan, Abdullah; Li, Yeqing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.12164 (cs)

[Submitted on 16 Jul 2024 (v1), last revised 31 Oct 2024 (this version, v2)]

Title:Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Authors:Yanting Miao, William Loh, Suraj Kothawade, Pascal Poupart, Abdullah Rashwan, Yeqing Li

View PDF HTML (experimental)

Abstract:Text-to-image generative models have recently attracted considerable interest, enabling the synthesis of high-quality images from textual prompts. However, these models often lack the capability to generate specific subjects from given reference images or to synthesize novel renditions under varying conditions. Methods like DreamBooth and Subject-driven Text-to-Image (SuTI) have made significant progress in this area. Yet, both approaches primarily focus on enhancing similarity to reference images and require expensive setups, often overlooking the need for efficient training and avoiding overfitting to the reference images. In this work, we present the $\lambda$-Harmonic reward function, which provides a reliable reward signal and enables early stopping for faster training and effective regularization. By combining the Bradley-Terry preference model, the $\lambda$-Harmonic reward function also provides preference labels for subject-driven generation tasks. We propose Reward Preference Optimization (RPO), which offers a simpler setup (requiring only $3\%$ of the negative samples used by DreamBooth) and fewer gradient steps for fine-tuning. Unlike most existing methods, our approach does not require training a text encoder or optimizing text embeddings and achieves text-image alignment by fine-tuning only the U-Net component. Empirically, $\lambda$-Harmonic proves to be a reliable approach for model selection in subject-driven generation tasks. Based on preference labels and early stopping validation from the $\lambda$-Harmonic reward function, our algorithm achieves a state-of-the-art CLIP-I score of 0.833 and a CLIP-T score of 0.314 on DreamBench.

Comments:	NeurIPS 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2407.12164 [cs.CV]
	(or arXiv:2407.12164v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.12164

Submission history

From: Yanting Miao [view email]
[v1] Tue, 16 Jul 2024 20:40:25 UTC (28,921 KB)
[v2] Thu, 31 Oct 2024 03:03:47 UTC (24,211 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators