# audio-intelligence **Repository Path**: mirrors_NVIDIA/audio-intelligence ## Basic Information - **Project Name**: audio-intelligence - **Description**: Elucidated Text-To-Audio (ETTA) is a SOTA text-to-audio model with a holistic understanding of the design space and trained with synthetic captions. - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-04 - **Last Updated**: 2026-03-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Audio Intelligence # Overview This repository contains implementation of several state-of-the-art audio intelligence research projects from NVIDIA. # Projects # Audio Understanding, Generation, and Reasoning ### UALM (Audio Understanding and Generation) **UALM: Unified Audio Language Model for Understanding, Generation, and Reasoning (ICLR 2026 oral)**
UALM is an advanced Audio-Language Model that **unifies** text and audio tasks including: text problem solving, audio understanding, text-to-audio generation, and multimodal reasoning across modalities. UALM matches the quality of state-of-the-art specialized models in each task, and is the first demonstration of cross-modal generative reasoning in the audio research domain. --- ### Music Flamingo (Music Understanding) **Music Flamingo: Scaling Music Understanding in Audio Language Models (ICLR 2026)**
Music Flamingo is a fully open, state-of-the-art Large Audio-Language Model designed to advance music (including song) understanding in foundational audio models. --- ### Audio Flamingo 3 (Audio Understanding) **Advancing Audio Intelligence with Fully Open Large Audio Language Models (NeurIPS 2025)**

Audio Flamingo 3 is a 7B audio language model using the [LLaVA](https://arxiv.org/abs/2304.08485) architecture for audio understanding. We trained our unified AF-Whisper audio encoder based on [Whisper](https://arxiv.org/abs/2212.04356) to handle understanding beyond speech recognition. We included speech-related tasks in Audio Flamingo 3 and scaled up the training dataset to about 50M audio-text pairs. Therefore, Audio Flamingo 3 is able to handle all three modalities in audio: **sound**, **music**, and **speech**. It outperforms prior SOTA models on a number of understanding and reasoning benchmarks. Audio Flamingo 3 can take up to 10 minutes of audio inputs, and has a streaming TTS module (AF3-Chat) to output voice. --- ### ETTA (Audio Generation) **Elucidating the Design Space of Text-to-Audio Models (ICML 2025)** **Improving Text-To-Audio Models with Synthetic Captions**
ETTA is a 1.4B latent diffusion model for text-to-audio generation. We trained ETTA on over 1M synthetic captions annotated by Audio Flamingo, and proved that this approach can lead to high quality audio generation as well as emergent abilities with scale. --- ### Fugatto 1 (Audio Editing and Generation) **Foundational Generative Audio Transformer Opus 1 (ICLR 2025)**

Fugatto is a versatile audio synthesis and transformation model capable of following free-form text instructions with optional audio inputs. --- ### DCASE 2025 Challenge Task 5 (Audio Understanding Challenge) **Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in the DCASE 2025 Challenge**
--- ### TangoFlux (Audio Generation) **Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization (ICLR 2026)**

TangoFlux is an efficient and high-quality text-to-audio model with FluxTransformer and CLAP-ranked preference optimization. This project was in collaboration with SUTD and Lambda Labs. --- ### OMCAT (Audio-Visual Understanding) **Omni Context Aware Transformer**
OMCAT is an audio-visual understanding model with ROTE (Rotary Time Embeddings).

---

# Representation Learning ### UniWav (Speech Codec) **Towards Unified Pre-training for Speech Representation Learning and Generation (ICLR 2025)**


---

# Audio Enhancement ### A2SB (Bandwidth Extension and Inpainting) **Audio-to-Audio Schrodinger Bridges**

A2SB is an audio restoration model tailored for high-res music at 44.1kHz. It is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, and able to restore hour-long audio inputs. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets. --- ### CleanUNet (Speech Denoising) **CleanUNet: Speech Denoising in the Waveform Domain with Self-Attention** **Cleanunet 2: A hybrid speech denoising model on waveform and spectrogram**

CleanUNet is a causal speech denoising model on the raw waveform. CleanUNet 2 is a speech denoising model that combines the advantages of waveform denoiser and spectrogram denoiser and achieves the best of both worlds.

---

# Text-to-Speech Models ### BigVGAN-v2 **A Universal Neural Vocoder with Large-Scale Training (ICLR 2023)**

BigVGAN-v2 is a widely-used universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning. We release our checkpoints with various configurations such as sampling rates. --- ### P-Flow and A2-Flow **P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting** **A2-Flow: Alignment-Aware Pre-training for Speech Synthesis with Flow Matching**

Please refer to the [Magpie-TTS API](https://build.nvidia.com/nvidia/magpie-tts-flow) for interactive demo of NVIDIA's TTS models that leveraged techniques of these papers. --- ### DiffWave **A Versatile Diffusion Model for Audio Synthesis (ICLR 2021 oral)**

DiffWave is the first diffusion model for raw waveform synthesis. It is a versatile waveform synthesis model for speech and non-speech generation.
--- ### WaveGlow **A Flow-based Generative Network for Speech Synthesis**

--- ### Flowtron **Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis**

--- ### RAD **RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis** **RAD-MMM: Multilingual Multiaccented Multispeaker TTS with RADTTS**
# License The codes for different projects may be released under different licenses, including MIT, NVIDIA OneWay Noncommercial License, NVIDIA Sourcecode License, and so on. Please refer to each project folder or their original GitHub links for the detailed licenses.