• About
  • Privacy Policy
  • Disclaimer
  • Contact
Soft Bliss Academy
No Result
View All Result
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups
Soft Bliss Academy
No Result
View All Result
Home Artificial Intelligence

Pushing the frontiers of audio generation

softbliss by softbliss
April 30, 2025
in Artificial Intelligence
0
Pushing the frontiers of audio generation
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


Technologies

Published
30 October 2024
Authors

Zalán Borsos, Matt Sharifi and Marco Tagliasacchi


An illustration depicting speech patterns, iterative progress on dialogue generation,  and a relaxed conversation between two voices.

Our pioneering speech generation technologies are helping people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.

Speech is central to human connection. It helps people around the world exchange information and ideas, express emotions and create mutual understanding. As our technology built for generating natural, dynamic voices continues to improve, we’re unlocking richer, more engaging digital experiences.

Over the past few years, we’ve been pushing the frontiers of audio generation, developing models that can create high quality, natural speech from a range of inputs, like text, tempo controls and particular voices. This technology powers single-speaker audio in many Google products and experiments — including Gemini Live, Project Astra, Journey Voices and YouTube’s auto dubbing — and is helping people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.

Working together with partners across Google, we recently helped develop two new features that can generate long-form, multi-speaker dialogue for making complex content more accessible:

  • NotebookLM Audio Overviews turns uploaded documents into engaging and lively dialogue. With one click, two AI hosts summarize user material, make connections between topics and banter back and forth.
  • Illuminate creates formal AI-generated discussions about research papers to help make knowledge more accessible and digestible.

Here, we provide an overview of our latest speech generation research underpinning all of these products and experimental tools.

Pioneering techniques for audio generation

For years, we’ve been investing in audio generation research and exploring new ways for generating more natural dialogue in our products and experimental tools. In our previous research on SoundStorm, we first demonstrated the ability to generate 30-second segments of natural dialogue between multiple speakers.

This extended our earlier work, SoundStream and AudioLM, which allowed us to apply many text-based language modeling techniques to the problem of audio generation.

SoundStream is a neural audio codec that efficiently compresses and decompresses an audio input, without compromising its quality. As part of the training process, SoundStream learns how to map audio to a range of acoustic tokens. These tokens capture all of the information needed to reconstruct the audio with high fidelity, including properties such as prosody and timbre.

AudioLM treats audio generation as a language modeling task to produce the acoustic tokens of codecs like SoundStream. As a result, the AudioLM framework makes no assumptions about the type or makeup of the audio being generated, and can flexibly handle a variety of sounds without needing architectural adjustments — making it a good candidate for modeling multi-speaker dialogues.

Example of a multi-speaker dialogue generated by NotebookLM Audio Overview, based on a few potato-related documents.

Building upon this research, our latest speech generation technology can produce 2 minutes of dialogue, with improved naturalness, speaker consistency and acoustic quality, when given a script of dialogue and speaker turn markers. The model also performs this task in under 3 seconds on a single Tensor Processing Unit (TPU) v5e chip, in one inference pass. This means it generates audio over 40-times faster than real time.

Scaling our audio generation models

Scaling our single-speaker generation models to multi-speaker models then became a matter of data and model capacity. To help our latest speech generation model produce longer speech segments, we created an even more efficient speech codec for compressing audio into a sequence of tokens, in as low as 600 bits per second, without compromising the quality of its output.

The tokens produced by our codec have a hierarchical structure and are grouped by time frames. The first tokens within a group capture phonetic and prosodic information, while the last tokens encode fine acoustic details.

Even with our new speech codec, producing a 2-minute dialogue requires generating over 5000 tokens. To model these long sequences, we developed a specialized Transformer architecture that can efficiently handle hierarchies of information, matching the structure of our acoustic tokens.

With this technique, we can efficiently generate acoustic tokens that correspond to the dialogue, within a single autoregressive inference pass. Once generated, these tokens can be decoded back into an audio waveform using our speech codec.

Animation showing how our speech generation model produces a stream of audio tokens autoregressively, which are decoded back to a waveform consisting of a two-speaker dialogue.

To teach our model how to generate realistic exchanges between multiple speakers, we pretrained it on hundreds of thousands of hours of speech data. Then we finetuned it on a much smaller dataset of dialogue with high acoustic quality and precise speaker annotations, consisting of unscripted conversations from a number of voice actors and realistic disfluencies — the “umm”s and “aah”s of real conversation. This step taught the model how to reliably switch between speakers during a generated dialogue and to output only studio quality audio with realistic pauses, tone and timing.

In line with our AI Principles and our commitment to developing and deploying AI technologies responsibly, we’re incorporating our SynthID technology to watermark non-transient AI-generated audio content from these models, to help safeguard against the potential misuse of this technology.

New speech experiences ahead

We’re now focused on improving our model’s fluency, acoustic quality and adding more fine-grained controls for features, like prosody, while exploring how best to combine these advances with other modalities, such as video.

The potential applications for advanced speech generation are vast, especially when combined with our Gemini family of models. From enhancing learning experiences to making content more universally accessible, we’re excited to continue pushing the boundaries of what’s possible with voice-based technologies.

Acknowledgements

Authors of this work: Zalán Borsos, Matt Sharifi, Brian McWilliams, Yunpeng Li, Damien Vincent, Félix de Chaumont Quitry, Martin Sundermeyer, Eugene Kharitonov, Alex Tudor, Victor Ungureanu, Karolis Misiunas, Sertan Girgin, Jonas Rothfuss, Jake Walker and Marco Tagliasacchi.

We thank Leland Rechis, Ralph Leith, Paul Middleton, Poly Pata, Minh Truong and RJ Skerry-Ryan for their critical efforts on dialogue data.

We’re very grateful to our collaborators across Labs, Illuminate, Cloud, Speech and YouTube for their outstanding work bringing these models into products.

We also thank Françoise Beaufays, Krishna Bharat, Tom Hume, Simon Tokumine, James Zhao for their guidance on the project.


Tags: AudiofrontiersgenerationPushing
Previous Post

Grading our 2024 Oscars Machine Learning Predictions – The Official Blog of BigML.com

Next Post

Announcing the general availability of Llama 4 MaaS on Vertex AI

softbliss

softbliss

Related Posts

The Global Race for Smuggled AI Chips
Artificial Intelligence

The Global Race for Smuggled AI Chips

by softbliss
June 16, 2025
Are We Entering a New Era of Digital Freedom or Exploitation?
Artificial Intelligence

Are We Entering a New Era of Digital Freedom or Exploitation?

by softbliss
June 16, 2025
TacticAI: an AI assistant for football tactics
Artificial Intelligence

TacticAI: an AI assistant for football tactics

by softbliss
June 15, 2025
Have a damaged painting? Restore it in just hours with an AI-generated “mask” | MIT News
Artificial Intelligence

Have a damaged painting? Restore it in just hours with an AI-generated “mask” | MIT News

by softbliss
June 15, 2025
Google Gemini Introduces Kid-Safe AI
Artificial Intelligence

Google Gemini Introduces Kid-Safe AI

by softbliss
June 14, 2025
Next Post
Announcing the general availability of Llama 4 MaaS on Vertex AI

Announcing the general availability of Llama 4 MaaS on Vertex AI

Premium Content

How to Build an Asynchronous AI Agent Network Using Gemini for Research, Analysis, and Validation Tasks

How to Build an Asynchronous AI Agent Network Using Gemini for Research, Analysis, and Validation Tasks

June 9, 2025
Understanding and Mitigating Failure Modes in LLM-Based Multi-Agent Systems

Understanding and Mitigating Failure Modes in LLM-Based Multi-Agent Systems

March 26, 2025
Sam Altman-backed Coco Robotics raises $80M

Sam Altman-backed Coco Robotics raises $80M

June 12, 2025

Browse by Category

  • Artificial Intelligence
  • Machine Learning
  • Research & Academia
  • Software Development
  • Startups

Browse by Tags

Amazon App Apps Artificial Blog Build Building Business Coding Data Development Digital Framework Future Gemini Generative Google Growth Guide Innovation Intelligence Language Learning LLM LLMs Machine Microsoft MIT model Models News NVIDIA opinion OReilly Research Science Series Software Startup Startups Strategies students Tech Tools Video

Soft Bliss Academy

Welcome to SoftBliss Academy, your go-to source for the latest news, insights, and resources on Artificial Intelligence (AI), Software Development, Machine Learning, Startups, and Research & Academia. We are passionate about exploring the ever-evolving world of technology and providing valuable content for developers, AI enthusiasts, entrepreneurs, and anyone interested in the future of innovation.

Categories

  • Artificial Intelligence
  • Machine Learning
  • Research & Academia
  • Software Development
  • Startups

Recent Posts

  • What It Is and What Teachers Can
  • ShareChat’s chief business officer Gaurav Jain steps down
  • Melding data, systems, and society | MIT News

© 2025 https://softblissacademy.online/- All Rights Reserved

No Result
View All Result
  • Home
  • Artificial Intelligence
  • Software Development
  • Machine Learning
  • Research & Academia
  • Startups

© 2025 https://softblissacademy.online/- All Rights Reserved

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?