Meta Releases Open-Source Generative AI Text-to-Sound Engine AudioCraft
Meta’s Fundamental AI Research (FAIR) has unveiled a new generative AI music and sound model named AudioCraft. The new framework can transform a text prompt into any kind of sound by melding the text-to-music model MusicGen with the text-to-natural-sound AI tool called AudioGen, enhanced by EnCodec, a decoder that compresses the training required for the AI models to work.
Meta first showcased MusicGen a couple of months ago, demonstrating how it could translate a written prompt into music, though the samples were only about 12 seconds long. The text could be supplemented by an audio clip to serve as a reference for the AI to build on. AudioGen does similar work but with an emphasis on realistic environmental sound. To make AudioCraft perform as desired, Meta relies on EnCodec, which processes raw sound into audio tokens, setting up what Meta calls a “fixed vocabulary” that can train language models to generate new sounds, whether a natural background or a musical score. The result, AudioCraft, simplifies the process relative to earlier projects.
“AudioCraft works for music and sound generation and compression — all in the same place. Because it’s easy to build on and reuse, people who want to build better sound generators, compression algorithms, or music generators can do it all in the same code base and build on top of what others have done,” Meta explained in a blog post. “And while a lot of work went into making the models simple, the team was equally committed to ensuring that AudioCraft could support the state of the art. People can easily extend our models and adapt them to their use cases for research. There are nearly limitless possibilities once you give people access to the models to tune them to their needs. And that’s what we want to do with this family of models: give people the power to extend their work.”
Making AudioCraft open-source provides flexibility for developers to play with the framework of the AI, which is likely to entice plenty of users beyond those interested in synthesizing audio tracks. It could also give Meta a leg up against the competition. It’s similar to Meta’s strategy in releasing the new Llama 2 large language model without requiring a business license fee. By comparison, Google’s MusicLM generative AI music composer has only been glimpsed in a few demonstrations and has yet to make any splash among the public beyond that initial burst of attention.
Not that AudioCraft is unique as a synthetic sonic composer. Generative AI powers tools like Riffusion, which uses Stable Diffusion to turn a text prompt into a sonogram. Riffusion then uses Torchaudio to read the frequency and time to play the sound. Voicemod’s synthetic song generator, which matches submitted lyrics to a selection of popular songs and AI voices, and the text-centered LyricStudio, which claims its AI has assisted in writing more than a million songs, also contribute to the overall symphony.You can hear a few examples of AudioCraft in the tracks below.
Prompt: Pop dance track with catchy melodies, tropical percussion, and upbeat rhythms, perfect for the beach
Prompt: Earthy tones, environmentally conscious, ukulele-infused, harmonic, breezy, easygoing, organic instrumentation, gentle grooves
Prompt: Sirens and a humming engine approach and pass