AI can now create CD-quality music from text, and it’s only getting better – Ars Technica

3D illustration of a toy robot song.

Imagine typing “dramatic intro music” and hearing a soaring symphony, or typing “creepy footsteps” and getting high-quality sound effects. That’s the promise of Steady Audio, a text-to-audio AI model Steady AI announced Wednesday that can synthesize music or sounds from written descriptions. Before long, similar technology could challenge musicians for their jobs.

If you recall, Steady AI is the company that helped fund the creation of Steady Diffusion, a latent diffusion image synthesis model released in August 2022. Not limited to image generation, the company also ventured into audio by backing Harmonai, an AI lab that launched music generator Dance Diffusion in September.

Now stability and harmony want to enter professional AI audio production with stable audio. Judging by the production samples, this looks like a significant audio quality upgrade from previous AI audio generators we’ve seen.

On its promotional page, the stable offers examples of AI models with prompts such as “Epic Trailer Music Intense Tribal Percussion & Brass” and “Loofy Hip Hop Beat Melodic ChillHop 85 BPM”. It also offers samples of sound effects generated using static audio, such as airline pilots talking over the intercom and people talking in a busy restaurant.

To train its model, Stability partnered with stock music provider AudioSparx and “licensed a data set consisting of over 800,000 audio files containing music, sound effects and single-instrument stems as well as associated textual metadata.” After feeding 19,500 hours of audio into the model, StableAudio knows how to imitate the specific sounds it hears on command because the sounds are associated with text descriptions in its neural network.

Block diagram of static audio architecture provided by stability AI.
enlarge / Block diagram of static audio architecture provided by stability AI.

Stability AI

Static Audio consists of several parts that work together to quickly create customized audio. One part compresses an audio file in a way that keeps its important features while removing unnecessary noise. This makes the system faster to learn and faster to generate new audio. The second part uses text (metadata descriptions of music and sounds) to help guide what kind of audio is generated.

To speed things up, static audio architectures operate on a greatly simplified, compressed audio representation that reduces inference time (the time it takes for a machine learning model to produce an output after being given input). According to the stability AI, stable audio on the Nvidia A100 GPU can render 95 seconds of stereo audio at a 44.1 kHz sample rate (often called “CD quality”). The A100 is a beefy data center GPU designed for AI use and is far more capable than a typical desktop gaming GPU.

As mentioned, static audio is not the first music generator based on latent propagation techniques. Last December, we covered Refusion, a hobbyist audio version of Stable Diffusion, though it was generations away from quality Stable Audio samples. In January, Google released MusicLM, an AI music generator for 24 kHz audio, and Meta launched a suite of open source audio tools (including a text-to-music generator) called AudioCraft in August. Now, with 44.1 kHz stereo audio, static transmission is undone.

Stable says Stable Audio will be available in a free tier and a $12 monthly Pro plan. With the free option, users can create up to 20 tracks per month, each with a maximum length of 20 seconds. The Pro plan expands these limits, allowing up to 500 track generation per month and track lengths of up to 90 seconds. Future stability releases are expected to include open source models based on stable audio architectures, as well as training code for those interested in developing audio generation models.

As it stands, it looks like we could be on the verge of production-quality AI-generated music with stable audio, given its audio fidelity. Would they be happy if musicians were replaced by AI models? If history has taught us anything from AI protests in the visual arts, perhaps not. For now, a human can easily surpass anything an AI can create, but that won’t last long. Either way, AI-generated audio can become another tool in the professional’s audio production toolbox.

Leave a Comment