Maeve Flack is a spunky, spirited 8-year-old. She also has cerebral palsy and communicates with a speech synthesizer, using her eye gaze to control what she says.
Her mother, Kara, is thrilled her daughter can express herself. But the voice that Maeve’s synthesizer produces is that of a computerized adult woman.
That same generic voice is likely used by thousands of other people. Choices are limited when it comes to voice synthesizers for those with speech impairments. But Dr. Rupal Patel is working to change that.
Patel, a professor in the Departments of Communication Science and Disorders and the Computer and Information Science at Northeastern University in Boston, is developing algorithms that build personalized voices for those unable to speak. As founder and director of the Communication Analysis and Design Laboratory (CadLab), she helped create a technology that combines a database of donor recordings with a sample of a patient’s own speech, however limited it may be. The project is called VocaliD, and it’s providing people who can’t speak with a voice all their own — natural-sounding and tailored to their unique vocal identities.
Patel says our voices are a unique and fundamental aspect of who we are. But for people with speech disorders or neuromuscular impairments such as cerebral palsy, Parkinson’s disease, or ALS, speech is no longer a viable option for everyday communication.
According to Patel, there are approximately 2.5 million people in the United States who use assistive devices to communicate. This technology allows them to select words that the device then speaks out loud, but it’s still limited in the number and kinds of voices available. Renowned physicist Stephen Hawking uses a speech synthesizer with one of the most popular voices, an American-accented male voice he shares with thousands of other people.
Some of the major companies that make assistive devices for those who cannot speak are Dynavox, Assistive Ware, and Prentke Romich. Russell Cross, the Director of Clinical Applications for Prentke Romich, says their company is focused on developing voice output technologies and different ways of accessing them. “Some people can press buttons, some people have to use switches, and some people are using eye gaze systems,” he says. Prentke Romich offers everything from a $150 app for use with iPads to eye gaze systems that cost $12,000.
Cross says they are always looking to make their technology smaller, lighter, and cheaper, as well as more accessible. “We’re working on improvements in how people can access the technology, the best example of which has been the development of eye gaze systems,” he says. And while like other assistive device companies, they don’t design the voices their devices use — instead using voices from other companies with their machines — Cross says they’re always looking to see what new voices are available to integrate into their devices.
Patel began thinking about this lack of individuation in synthetic voices at a conference for the makers and users of prosthetic voices in 2002. She saw a little girl and a grown man having a conversation with their speech synthesizer devices.
“They were using the same voice,” she recalls. “And as I turned around, I noticed that same voice was coming at me from everywhere.”
Patel incubated the idea of making synthetic voices more personalized for several years. We wouldn’t give a little girl a prosthetic limb of a grown man, she thought. So why give her a grown man’s voice?
From her research, Patel had demonstrated that even people unable to speak could still control aspects of their voice. She wanted to find out what qualities she could harness from these residual vocal abilities and how she could use them to create a customized technology.
Speech is a combination of two signals. First, the source of the speech comes from vibrating one’s vocal folds. Next, that air is pushed through the rest of the vocal tract, called the filter. It’s a combination of the source and the filter that results in speech.
“The speech sounds, consonants and vowels, are all due to the filter,” says Patel. “We form our mouths and tongues to shape the air that’s created by our voice box to make these specific sounds.”
In individuals with very limited speech production abilities, even though their filters are impaired, they can still control many aspects of their voice: pitch, loudness, breathiness, and tempo. These are known as prosody.
Patel can take the sounds people with speech disorders can produce, such as a sustained vowel sound, and extract prosodic characteristics of their voice.
“There’s a lot that’s happening before the air stream even gets shaped into consonants and vowels, and that is what we want to capture, because there is vocal identity information in that aspect of the signal,” says Patel.
This insight — that there is something preserved in the voices of people who can’t speak that can be harnessed, and that those same cues are important for speaker identification — led Patel to the idea for VocaliD.
“We had all these voice samples from people who couldn’t speak but could still make vocalizations,” Patel says. “Why don’t we combine that with a voice donor who can create lots of speech?”
The idea is to take the source from the person who cannot speak and the filter from a voice donor who can articulate speech and blend the two together. When they’re mixed, the result is a personalized prosthetic voice that is as clear as the voice donor and similar in prosody to the voice recipient.
To build the first set of prosthetic voices, Patel and her collaborator, Tim Bunnell, had voice donors come to a studio and record about three to four hours of speech. Donors went through a list of thousands of utterances covering all the different combinations of sounds in the English language.
In the lab, these recordings were broken down into little snippets of speech that can be recombined to create totally novel sentences that were never said by the donor. It’s a method called concatenative synthesis.
Alan Black, a computer science professor at Carnegie Mellon University in Pennsylvania, helped develop concatenative synthesis in the 1980s. “It changed the quality of speech synthesis from being understandable but robotic to potentially being as good as prerecorded speech,” he says. “The basic idea is to take small pieces of carefully recorded speech, join them together, and modify the pitch and duration to make it sound more natural.”
Patel and Bunnell also used concatenative synthesis to build the donor database, which currently has over 1,700 donors. Then from the voice recipient, they determined qualities like pitch, loudness, and breathiness from whatever sounds they are able to produce. They then blended the recordings from the recipient and a donor of similar age and gender together using voice transformation techniques. The result is a brand-new voice, reverse-engineered to approximate what the recipient might sound like if he or she could talk. The new voice has the vocal identity of the recipient and the vocal clarity of the donor.
So far, these personalized voices aren’t available to everyone. In the last three years, Patel’s research team has built voices for three young women, now 11, 13, and 19 years old. It took 10-15 hours to create each voice, requiring a combination of a computational algorithm and some manual tweaking to get them just right.
Patel expects that recipients who receive prosthetic voices as children or teens will come in to update their voices as they mature. The recipient would have to provide a new voice sample, but Patel hasn’t worked out yet if an entirely new donor would also be needed, or if there is a way to computationally “grow” a younger donated voice into a more mature one.
The early results of VocaliD are exciting. “I think the goal is excellent,” says Jan van Santen, a professor who works on speech synthesis at Oregon Health and Science University. “It deals with a very difficult situation, namely where the only thing a patient can still do is just say a vowel sound. To extract any individual character out of that kind of minimal recording is very difficult.”
In an effort to transition the technology from lab to venture, Patel has founded VocaliD INC and launched the Human Voicebank Initiative, a project to collect a million donor voices. There are currently over 24,000 people signed up who want to donate their voices.
With so much interest, Patel is working hard to build up VocaliD’s tech infrastructure. The blending of voices is currently a labor-intensive job, so Patel is exploring ways to make it more automated. She has developed a web client and is working on an iPhone app that will allow donors to record in their own time, without having to come into the lab.
Patel is also working on refining the process for matching voice donors and recipients. They originally matched them for age and gender, but now they’re thinking of including even more variables, like the size of a person and where they have lived. Patel compares it to finding an organ donor: the better the match, the more successful the result. “The more similar the donor and recipient, the less digital distortion there will be when you mix those voices together and the more likely it will be that you’ll get an authentic sounding voice,” she says.
Patel continues, “One goal is getting this technology into more people’s hands. But a bigger mission is communicating the importance of having a personalized voice.” She says the aesthetics of personalization are treated as if they’re not important, but they have huge social ramifications.
“There are lots of people who have a device to communicate, but it’s not enough,” Patel says. “Having a personalized voice is actually key to having them own the technology and feel comfortable using it.”
That is certainly the case for Maeve Flack. Her family, who lives in Boston, heard about Patel’s work when a friend forwarded them the link to a TED talk she gave in 2013. Soon after, Maeve’s 10-year-old sister, Erin, emailed Patel and volunteered to donate her voice to her little sister.
Erin and her family are eager to start the process of building a personalized voice for Maeve. The girls’ mother, Kara, says people generally take for granted the differentiation between their voices.
“It’s wonderful for her to have a voice, but to have her own voice would just be so special,” she says. “Now, we can run into somebody else who has the same device as Maeve and their voices can be exactly the same.
“I love this technology,” Kara continues. “We’re thrilled to have this option for Maeve, and look forward to hearing her talk in a voice that reflects her unique personality.”