New AI Tech Can Mimic Any Voice – Scientific American

Posted: May 2, 2017 at 11:03 pm

Even the most natural-sounding computerized voiceswhether its Apples Siri or Amazons Alexastill sound like, well, computers. Montreal-based start-up Lyrebird is looking to change that with an artificially intelligent system that learns to mimic a persons voice by analyzing speech recordings and the corresponding text transcripts as well as identifying the relationships between them. Introduced last week, Lyrebirds speech synthesis can generate thousands of sentences per secondsignificantly faster than existing methodsand mimic just about any voice, an advancement that raises ethical questions about how the technology might be used and misused.

The ability to generate natural-sounding speech has long been a core challenge for computer programs that transform text into spoken words. Artificial intelligence (AI) personal assistants such as Siri, Alexa, Microsofts Cortana and the Google Assistant all use text-to-speech software to create a more convenient interface with their users. Those systems work by cobbling together words and phrases from prerecorded files of one particular voice. Switching to a different voicesuch as having Alexa sound like a manrequires a new audio file containing every possible word the device might need to communicate with users.

Lyrebirds system can learn the pronunciations of characters, phonemes and words in any voice by listening to hours of spoken audio. From there it can extrapolate to generate completely new sentences and even add different intonations and emotions. Key to Lyrebirds approach are artificial neural networkswhich use algorithms designed to help them function like a human brainthat rely on deep-learning techniques to transform bits of sound into speech. A neural network takes in data and learns patterns by strengthening connections between layered neuronlike units.

After learning how to generate speech the system can then adapt to any voice based on only a one-minute sample of someones speech. Different voices share a lot of information, says Lyrebird co-founder Alexandre de Brbisson, a PhD student at the Montreal Institute for Learning Algorithms laboratory at the University of Montreal. After having learned several speakers voices, learning a whole new speaker's voice is much faster. Thats why we dont need so much data to learn a completely new voice. More data will still definitely help, yet one minute is enough to capture a lot of the voice DNA.

Lyrebird showcased its system using the voices of U.S. political figures Donald Trump, Barack Obama and Hillary Clinton in a synthesized conversation about the start-up itself. The company plans to sell the system to developers for use in a wide range of applications, including personal AI assistants, audio book narration and speech synthesis for people with disabilities.

Last year Google-owned company DeepMind revealed its own speech-synthesis system, called WaveNet, which learns from listening to hours of raw audio to generate sound waves similar to a human voice. It then can read a text out loud with a humanlike voice. Both Lyrebird and WaveNet use deep learning, but the underlying models are different, de Brbisson says. Lyrebird is significantly faster than WaveNet at generation time, he says. We can generate thousands of sentences in one second, which is crucial for real-time applications. Lyrebird also adds the possibility of copying a voice very fast and is language-agnostic. Scientific American reached out to DeepMind but was told WaveNet team members were not available for comment.

Lyrebirds speed comes with a trade-off, however. Timo Baumann, a researcher who works on speech processing at the Language Technologies Institute at Carnegie Mellon University and is not involved in the start-up, noted Lyrebirds generated voice carries a buzzing noise and a faint but noticeable robotic sheen. Moreover, it does not generate breathing or mouth movement sounds, which are common in natural speaking. Sounds like lip smack and inbreathe are important in conversation. They actually carry meaning and are observable to the listener, Baumann says. These flaws make it possible to distinguish the computer-generated speech from genuine speech, he adds. We still have a few years before technology can get to a point that it could copy a voice convincingly in real-time, he adds.

Still, to untrained ears and unsuspecting minds, an AI-generated audio clip could seem genuine, creating ethical and security concerns about impersonation. Such a technology might also confuse and undermine voice-based verification systems. Another concern is that it could render unusable voice and video recordings used as evidence in court. A technology that can be used to quickly manipulate audio will even call into question the veracity of real-time video in live streams. And in an era of fake news it can only compound existing problems with identifying sources of information. It will probably be still possible to find out when audio has been tampered with, Baumann says, but Im not saying that everybody will check.

Systems equipped with a humanlike voice may also pose less obvious but equally problematic risks. For example, users may trust these systems more than they should, giving out personal information or accepting purchasing advice from a device, treating it like a friend rather than a product that belongs to a company and serves its interests. Compared to text, voice is just much more natural and intimate to us, Baumann says.

Lyrebird acknowledges these concerns and essentially issues a warning in the brief ethics statement on the companys Web site. Lyrebird cautions the public that the software could be used to manipulate audio recordings used as evidence in court or to assume someone elses identity. We hope that everyone will soon be aware that such technology exists and that copying the voice of someone else is possible, according to the site.

Just as people have learned photographs cannot be fully trusted in the age of Photoshop, they may need to get used to the idea that speech can be faked. There is currently no way to prevent the technology from being used to make fraudulent audio, says Bruce Schneier, a security technologist and lecturer in public policy at the Kennedy School of Government at Harvard University. The risk of encountering a fake audio clip has now become the new reality, he says.

View original post here:

New AI Tech Can Mimic Any Voice - Scientific American

Related Posts