AudioPaLM: A Large Language Model That Can Speak and Listen

Rubenstein, Paul K.; Asawaroengchai, Chulayuth; Nguyen, Duc Dung; Bapna, Ankur; Borsos, Zalán; Quitry, Félix de Chaumont; Chen, Peter; Badawy, Dalia El; Han, Wei; Kharitonov, Eugene; Muckenhirn, Hannah; Padfield, Dirk; Qin, James; Rozenberg, Danny; Sainath, Tara; Schalkwyk, Johan; Sharifi, Matt; Ramanovich, Michelle Tadmor; Tagliasacchi, Marco; Tudor, Alexandru; Velimirović, Mihajlo; Vincent, Damien; Yu, Jiahui; Wang, Yongqiang; Zayats, Vicky; Zeghidour, Neil; Zhang, Yu; Zhang, Zhishuai; Zilka, Lukas; Frank, Christian

Computer Science > Computation and Language

arXiv:2306.12925 (cs)

[Submitted on 22 Jun 2023]

Title:AudioPaLM: A Large Language Model That Can Speak and Listen

View PDF

Abstract:We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at this https URL

Comments:	Technical report
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
Cite as:	arXiv:2306.12925 [cs.CL]
	(or arXiv:2306.12925v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.12925

Submission history

From: Paul Rubenstein [view email]
[v1] Thu, 22 Jun 2023 14:37:54 UTC (135 KB)

Computer Science > Computation and Language

Title:AudioPaLM: A Large Language Model That Can Speak and Listen

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:AudioPaLM: A Large Language Model That Can Speak and Listen

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators