NVIDIA recently announced a new experimental generative AI model, which it refers to as “a Swiss Army knife for sound.” The Foundational Generative Audio Transformer Opus 1 model, also referred to as Fugatto, accepts text inputs and uses them to create audio or modify existing music, speech, and sound files. It was developed by a global team of AI professionals, which NVIDIA claims increased the model’s “multi-accent and multilingual capabilities.”
“We wanted to create a model that understands and generates sound like humans do,” explained Rafael Valle, one of the project’s researchers and NVIDIA’s manager of applied audio research. In its announcement, the company outlined several potential real-world applications of Fugatto. It claimed that music producers could utilize the technology to quickly create a prototype for a song idea, which they could subsequently edit to experiment with different styles, voices, and instruments.
Fugatto could be used by an advertising agency to swiftly tailor an existing campaign for numerous regions or situations, using different accents and emotions in voiceovers. Language learning tools could be customized to utilize whichever voice the speaker prefers. For instance, an online course could be delivered in the voice of a family member or friend. The model could be used by video game creators to adjust prerecorded materials to match the changing action as users play the game. Fugatto can make unconventional sounds, like making a trumpet bark or a saxophone meow, showcasing its creative versatility. Whatever the user describes, the model can build. Researchers discovered that with some fine-tuning, it could handle jobs for which it had not been pre-trained.
Fugatto’s novelty is enhanced by a variety of capabilities. The model employs a technique called ComposableART during inference, which allows it to combine instructions that were viewed separately during training. The model’s ability to interpolate between instructions allows for precise control over text attributes like accent strength or emotional tone. AI researcher Rohan Badlani designed it to let users creatively combine and adjust these attributes. Badlani, who holds a master’s in AI from Stanford, found the results surprising and felt more like an artist despite his computer science background.
Fugatto also can generate sounds that changes over time, a feature known as temporal interpolation. For example, it can create a rainstorm with thunder that gradually fades into the distance or a thunderstorm transitioning to a serene dawn with birdsong. Unlike most models that can only replicate the training data they’ve been exposed to, Fugatto enables users to generate entirely new ones.
NVIDIA has not announced if Fugatto will be made available to the public, but the model is not the first generative AI tool capable of producing sounds from text prompts. Meta has already developed an open source AI kit that can generate sounds from text descriptions. Google has its own text-to-music AI, MusicLM, which is accessible via the company’s AI Test Kitchen website.
- Apple iOS 18.3 to Have Exciting New Features! - January 13, 2025
- Google is Assembling an Advanced ‘world modeling’ AI team - January 9, 2025
- The Year in AI: Top Breakthroughs that Redefined Technology in 2024 - January 7, 2025