- Research
We teach the next generation of researchers to develop scientific, social, and technological innovations.
We find solutions through interdisciplinary research and industry or public and community partnerships.
We play an active role in Québec's economic, social, and cultural development.
December 1, 2021 | Audrey-Maude Vézina
Update : December 20, 2021
Research on automatic speech recognition and synthesis has exploded with the optimization of machine learning.
The speed, intonation and intensity of speech can vary from person to person, which makes recognition and synthesis more complex.
Professor Douglas O’Shaughnessy of Institut national de la recherche scientifique (INRS) specializes in speech processing. This discipline is constantly evolving with the optimization of algorithms, mainly automatic speech synthesis and recognition. “You can simulate a person’s voice or a person who hears with human-like performance through artificial intelligence,” the researcher says. Assistants such as Siri or Alexa use the principle of voice recognition.
The difficulty in analyzing speech, compared to images, for example, stems from its variability. “The speed of speech can vary from person to person, including because of stress or emotion. Speech also varies in intonation and intensity,” explains the professor. “These parameters make recognition and synthesis more complex. ”
Noise can also pollute the voice signal, such as when a person speaks near a construction site.
AI is based on neural networks similar to those of a human brain. It acts like a black box: we can see what goes in and what comes out, but we don’t understand what’s going on inside.
The computer learns by sifting through great reams of data, such as audio recordings. However, there will never be enough data to cover all possible cases.
“There are billions of people in the world and they speak in many different ways. Just for French and English, 100,000 words exist, combinable in multitudes of ways. Languages have an infinite number of signals, so many that we cannot train computers for all possibilities.”
Douglas O’Shaughnessy, expert in speech processing
Computers divide speech into 10-millisecond segments for analysis, which is about 12 sounds per second, to account for variability. For languages like Mandarin that rely on tone to communicate, analyzing the sequence of words isn’t enough. Professor O’Shaughnessy is therefore working on the analysis of pitch. Speaker recognition could also benefit from these advances, given that it’s possible to deduce gender, age, and emotion.