Generating Synthetic Datasets by Interpolating along Generalized Geodesics

Fan, Jiaojiao; Alvarez-Melis, David

Computer Science > Machine Learning

arXiv:2306.06866 (cs)

[Submitted on 12 Jun 2023]

Title:Generating Synthetic Datasets by Interpolating along Generalized Geodesics

Authors:Jiaojiao Fan, David Alvarez-Melis

View PDF

Abstract:Data for pretraining machine learning models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target domain -- where the model will ultimately be used -- is known in advance. In that case, one would ideally pretrain only on the dataset(s) most similar to the target one. Instead of limiting this choice to those datasets already present in the pretraining collection, here we explore extending this search to all datasets that can be synthesized as `combinations' of them. We define such combinations as multi-dataset interpolations, formalized through the notion of generalized geodesics from optimal transport (OT) theory. We compute these geodesics using a recent notion of distance between labeled datasets, and derive alternative interpolation schemes based on it: using either barycentric projections or optimal transport maps, the latter computed using recent neural OT methods. These methods are scalable, efficient, and -- notably -- can be used to interpolate even between datasets with distinct and unrelated label sets. Through various experiments in transfer learning in computer vision, we demonstrate this is a promising new approach for targeted on-demand dataset synthesis.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2306.06866 [cs.LG]
	(or arXiv:2306.06866v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2306.06866
Journal reference:	Conference on Uncertainty in Artificial Intelligence (UAI) 2023

Submission history

From: Jiaojiao Fan [view email]
[v1] Mon, 12 Jun 2023 04:46:44 UTC (7,002 KB)

Computer Science > Machine Learning

Title:Generating Synthetic Datasets by Interpolating along Generalized Geodesics

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Generating Synthetic Datasets by Interpolating along Generalized Geodesics

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators