Like Henry Higgins, the phonetician from George Bernard Shaw’s play “Pygmalion,” Marius Kotescu and Georgy Tinchev recently demonstrated how their students were trying to overcome pronunciation difficulties.
Two data scientists working for Amazon were teaching in Europe Alexa, the company’s digital assistant. Their task: Helping Alexa master Irish-accented English artificial intelligence assistance and recordings from native speakers.
During the performance, Alexa talked about a memorable night out. “Last night’s party was fabulous,” Alexa muttered, using the Irish word for entertainment. “On the way home we got ice cream and we were happy.”
Mr. Tinchev shook his head. Alexa dropped the “r” in “party”, making the word sound flat like pah-tee. Very British, he concluded.
technologists are part of a team Amazon Working on a challenging area of data science called voice disentanglement. It’s an intriguing issue that has gained new relevance amid a wave of AI developments, with researchers believing that speech and technology puzzles could help make AI-powered devices, bots and speech synthesizers more conversational. That is, an accent capable of pulling in many regional people.
Dealing with voice disorders involves much more than just understanding vocabulary and syntax. The speaker’s pitch, timing, and accent often give words subtle meaning and emotional weight. Linguists call this language feature “prosody”, which machines have difficulty mastering.
Only in recent years, thanks to advances in AI, computer chips and other hardware, have researchers made progress in solving the issue of voicing, turning computer-generated speech into something more pleasant to the ear.
Such action may eventually change with the explosion of “generative ai“A technology that enables chatbots to generate their own responses. chatbots like chatgpt And the Bard could someday fully function and respond verbally to users’ voice commands. Plus, voice assistants like Alexa and Apple’s Siri will become more conversational, potentially rekindling consumer interest in technology. appears to be stuckanalysts said.
Speaking multiple languages with voice assistants like Alexa, Siri and Google Assistant has been an expensive and time-consuming process. Tech companies have hired voice actors to record hundreds of hours of speech, helping create synthetic voices for digital assistants. Advanced AI systems are known as “text-to-speech models” – because they convert text into natural-sounding synthetic speech – just starting to settle this process.
“Technology is now capable of producing a human voice and synthetic audio in different languages, accents and dialects based on text input,” said Marion Labouré, a senior strategist at Deutsche Bank Research.
Amazon is under pressure to outdo rivals such as Microsoft and Google in the AI race. In April, Andy Jassy, chief executive of Amazon, told Wall Street analysts The company plans to make Alexa “even more proactive and interactive” with the help of sophisticated generative AI, and Rohit Prasad, Amazon’s principal scientist for Alexa, said. told cnbc In May he envisioned voice assistants as voice-enabled “instantly available, personal AI”.
Irish Alexa made her commercial debut in November, after nine months of training to understand and then speak an Irish accent.
“Accent is different from language,” Mr. Prasad said in an interview. AI technologies will have to learn to remove accents from other parts of speech, such as tone and frequency, before replicating the characteristics of local dialects – for example, “a” may be flatter and “t” pronounced more forcefully. Go
These systems must detect these patterns, he said, “so that you can synthesize an entirely new tone.” “It’s difficult.”
Harder was still trying to get the technique of learning a new pronunciation on his own largely from different-sounding speech models. That’s what Mr. Cotescu’s team tried to do in building Irish Alexa. They relied heavily on existing speech models of predominantly British-English accents to train it to speak Irish English – with a much smaller range of American, Canadian and Australian accents.
The team faced various linguistic challenges of Irish English. For example, Irish people drop the “h” in “th”, for example, pronouncing the syllable as a hard “t” or “d”, making “bath” sound like “bat” or here. Until it becomes “bad”. Irish English is also rhotic, meaning the “r” is pronounced excessively. This means that the “r” in “party” will be more distinct than what you might hear from a Londoner’s mouth. Alexa had to learn and master these speech features.
Irish English is “tough,” said Mr Cotescu, who is Romanian and was the lead researcher on the Irish Alexa team.
The speech models powering Alexa’s verbal skills have been getting more advanced in recent years. Amazon researchers teach Alexa in 2020 to speak spanish fluently From an English language speaking model.
Mr. Cotescu and team see pronunciation as the next frontier of Alexa’s speech capabilities. He Irish designed Alexa to rely more heavily on AI than actors to build its speech models. As a result, Irish Alexa was trained on a relatively small corpus—about 24 hours of recordings by voice actors who recited 2,000 utterances in Irish-accented English.
In the beginning, when Amazon researchers fed Irish recordings to the still-learning Irish Alexa, some strange things happened.
Letters and syllables were sometimes thrown out of response. The “s” sometimes stick together. A word or two, sometimes important words, were muttered incoherently and incomprehensible. In at least one case, Alexa’s female voice dropped a few octaves, sounding more manly. Worse still, the manly voice sounded distinctly British, the kind of goofy that might raise eyebrows in some Irish households.
“They’re big black boxes,” Mr. Tinchev, a Bulgarian citizen who is Amazon’s lead scientist on the project, said of the speech models. “You have to do a lot of experimentation to tune them.”
This is what technologists did to fix Alexa’s “party” mistake. They sorted out the speech word-by-word, sound by sound (the smallest audible piece of a word) to find out where Alexa was slipping, and fixed it. They then fed more recorded voice data to Irish Alexa’s speech model to correct mispronunciations.
Result: The “r” in “party” is returned. But then the “p” disappeared.
So the data scientist went through the same process again. Eventually he focused on the phoneme that had the missing “p”. Then they further improved the model so that the “p” sound returned and the “r” did not disappear. Alexa was finally learning to speak like a Dubliner.
Two Irish linguists – Ellen Vaughan, who teaches at the University of Limerick, and Kate Tallon, a PhD student who works in the Phonetics and Speech Laboratory at Trinity College Dublin – have since given Alexa’s pronunciation of Irish high marks. That said, the way Irish Alexa stresses the “r” and softens the “t” gives Amazon the accent all right.
“It sounds authentic to me,” Ms. Tallon said.
Amazon researchers said they are gratified by the largely positive response. His speech model picked up the Irish accent so quickly that he hoped he could replicate the accent elsewhere.
“We also plan to expand our methodology to language accents other than English,” he wrote in a letter. January research paper About the Irish Alexa Project.