Microsoft researchers announced the text-to-speech AI model VALL-E, which can simulate the voice of a real person based on just a three-second audio sample. In this way, while preserving the intonations characteristic of the speaker, he reproduces any audio-textual material, as if the speech of a particular person had been heard. Its creators envision its use as an advanced application for reading and editing text, even with other generative AI models such as GPT-3, which generates the text.
Redmond points to VALL-E as a neural language model, based on a compression neural network called EnCodec that Meta announced last year. Unlike other text-to-speech processes that work by manipulating waveforms, Microsoft Audio Codec creates symbols from selected text and sample audio signals.
VALL-E essentially analyzes the characteristics of a given person’s speech, and splits the information using EnCodec into separate components, “phonetic codes,” to create the final waveform. In addition to imitating the tone of the speaker, it can also imitate the “acoustic environment” of the sound sample. For example, if the sample is cut from a phone call, it reproduces the acoustics and frequency characteristics of the phone call.
The Redmond researchers worked with the audio library provided by Meta, which contains more than 60,000 hours of English speech by more than 7,000 people. Since in order for VALL-E to generate high-quality, realistic content, the audio sample must show a high match with one of the data used for training, so it is planned to expand the database with additional data in the future.
Due to the violations, Microsoft does not make the test or the VALL-E code available to others at this time. According to its announcement, the company will follow its own guidelines for AI-related developments in the future, and a separate form is being prepared to determine if a VALL-E-assisted audio segment has been generated. Offline project on his GitHub page You can listen to how the algorithm makes music: it’s not perfect yet, and some tracks sound like a machine, but there are some really scary realistic results.