Back to top

The Language of Artificial Intelligence

December 1, 2021 | Audrey-Maude Vézina

Update : December 20, 2021

Research on automatic speech recognition and synthesis has exploded with the optimization of machine learning.

voice signal

The speed, intonation and intensity of speech can vary from person to person, which makes recognition and synthesis more complex.

Professor Douglas O’Shaughnessy of Institut national de la recherche scientifique (INRS) specializes in speech processing. This discipline is constantly evolving with the optimization of algorithms, mainly automatic speech synthesis and recognition. “You can simulate a person’s voice or a person who hears with human-like performance through artificial intelligence,” the researcher says. Assistants such as Siri or Alexa use the principle of voice recognition.

The difficulty in analyzing speech, compared to images, for example, stems from its variability. “The speed of speech can vary from person to person, including because of stress or emotion. Speech also varies in intonation and intensity,” explains the professor. “These parameters make recognition and synthesis more complex. ” 

Noise can also pollute the voice signal, such as when a person speaks near a construction site.


Learning speech

AI is based on neural networks similar to those of a human brain. It acts like a black box: we can see what goes in and what comes out, but we don’t understand what’s going on inside.

The computer learns by sifting through great reams of data, such as audio recordings. However, there will never be enough data to cover all possible cases.

“There are billions of people in the world and they speak in many different ways. Just for French and English, 100,000 words exist, combinable in multitudes of ways. Languages have an infinite number of signals, so many that we cannot train computers for all possibilities.”

Douglas O’Shaughnessy, expert in speech processing

Computers divide speech into 10-millisecond segments for analysis, which is about 12 sounds per second, to account for variability. For languages like Mandarin that rely on tone to communicate, analyzing the sequence of words isn’t enough. Professor O’Shaughnessy is therefore working on the analysis of pitch. Speaker recognition could also benefit from these advances, given that it’s possible to deduce gender, age, and emotion.