A Groundbreaking New AI Taught Itself to Speak in Just a Few Hours – Futurism

Posted: March 11, 2017 at 7:40 am

Giving Machines a Voice

Last year, Google successfully gave a machine the ability to generate human-like speech through its voice synthesis program called WaveNet. Powered by Googles DeepMind artificial intelligence (AI) deep neural network, WaveNet produced synthetic speech using given texts. Now, Chinese internet search company Baidu has developed the most advanced speech synthesis program ever, and its called Deep Voice.

Developed in Baidus AI research lab based in Silicon Valley, Deep Voice presents a big breakthrough in speech synthesis technology by largely doing away with the behind-the-scenes fine-tuning typically necessary for suchprograms. As such, Deep Voice can learn how to talk in a matter of a few hours and with virtually no help from humans.

Deep Voice uses a relatively simple method: through deep-learning techniques, Deep Voice broke down texts into phonemes which is sound at its smallest perceptually distinct units. A speech synthesis network then reproduced these sounds. The need for any fine-tuning was greatly reduced because every stage of the process relied on deep-learning techniques all researches needed to dowas train the algorithm.

For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original, the Baidu researchers wrote in a study published online. By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive domain expertise.

Text-to-speech systems arent entirely new. Theyre present in many of the worlds modern gadgets and devices.From simpler ones like talking clocks and answering systems in phones to more complex versions, like those in navigation apps. These, however, have been made using large databases of speech recordings. As such, the speech generated by these traditional text-to-speech systems dont flowas seamless as actual human speech.

Baidus work on Deep Voice is a step towards achieving human-like speech synthesis in real time, without using pre-recorded responses. Baidus Deep Voice puts together phonemes in such a way that it sounds like actual human speech. We optimize inference to faster-than-real-time speeds, showing that these techniques can be applied to generate audio in real-time in a streaming fashion, their researchers said.

However, there are still certain variables that their new system cannot yet control: the stresses on phonemes and the duration and natural frequency of each sound. Once perfected, control of these variables would allow Baidu to change the voice of the speaker and, possibly, the emotions conveyed by a word.

At the very least, this would be computationally demanding, limiting just how much Deep Voice can be used in real-time speech synthesis in the real world. As thethe Baidu researchersexplained:

In the future, better synthesized speech systems can be used to improvethe assistant features found in smartphones and smart home devices. At the very least, it wouldmake talking to your devices feel more real.

Read the original post:
A Groundbreaking New AI Taught Itself to Speak in Just a Few Hours - Futurism

Related Posts