Thanks to advances in speech processing and natural language, there is hope that one day you will be able to ask your virtual assistant what the best ingredients for a salad are. It is currently possible to ask your home gadget to play music or open it with a voice command, a feature that is already found on many devices.
If you speak Moroccan, Algerian, Egyptian, Sudanese, or any other dialect of Arabic, which varies greatly from region to region, where some of them are incomprehensible to each other, that is a different story. If your mother tongue is Arabic, Finnish, Mongolian, Navajo or any other language with a high level of morphological complexity, you may feel left out.
These complex constructions intrigued Ahmed Ali to find a solution. He is the chief engineer in the Arabic Language Technologies group at the Qatar Computing Research Institute (QCRI), part of Hamad Bin Khalifa University’s Qatar Foundation, and the founder of ArabicSpeech, “a community that exists for the benefit of Arabic speech science and speech technologies.”
But he became trapped by the idea of talking to cars, devices and gadgets many years ago while he was at IBM. “Can we make a machine that can understand different dialects – an Egyptian pediatrician who will automate a recipe, a Syrian teacher who will help children get the key parts from their lesson, or a Moroccan chef who describes the best recipe for couscous?” He said. However, the algorithms that run these machines cannot break through about 30 variants of Arabic, let alone give them meaning. Today, most speech recognition tools only work in English and a few other languages.
The coronavirus pandemic has further encouraged an increasing reliance on voice technology, with natural language processing technologies helping people adhere to guidelines for staying home and measures of physical distancing. However, while we have used voice commands to help with e-commerce shopping and manage our households, the future has even more applications.
Millions of people around the world use massive open online courses (MOOCs) for their open access and unlimited participation. Speech recognition is one of the main features in the MOOC, where students can search within specific areas in the speech content of the courses and enable translations through subtitles. Speech technology enables the digitization of lectures to display spoken words as text in university classrooms.
According to a recent article in Speech Technology, the voice and speech recognition market is projected to reach $ 26.8 billion by 2025, as millions of consumers and companies around the world begin to rely on voice bots not just to interact with their devices or cars. , but also to improve customer service, drive innovation in healthcare and improve accessibility and inclusion for those with hearing, speech or motor impairments.
In a 2019 survey, Capgemini predicted that by 2022, more than two out of three consumers would opt for voice assistants rather than visits to bank stores or branches; a share that could justifiably increase, given the domestic, physically distant life and trade that the epidemic has imposed on the world for more than a year and a half.
Nonetheless, these devices fail to deliver large parts of the world. For those 30 types of Arab and millions of people, this is a significantly missed opportunity.
Arabic for machines
Voice bots that speak English or French are far from perfect. However, learning machines to understand Arabic is especially difficult for several reasons. These are three generally accepted challenges:
- Lack of diacritical marks. Arabic dialects are popular, as they are mostly spoken. Most of the available text is not diacritical, which means that it lacks accents such as acute (´) or serious (`) that indicate the sound values of the letters. Therefore, it is difficult to determine where the vowels go.
- Lack of resources. There is a lack of marked data for different Arabic dialects. Together, they lack standardized orthographic rules that dictate how to write a language, including norms or spelling, hyphenation, word breaks, and accent. These resources are key to training computer models, and the fact that there are too few of them has hampered the development of Arabic speech recognition.
- Morphological complexity. Arabic speakers are involved in a lot of code switching. For example, in areas colonized by the French — North Africa, Morocco, Algeria, and Tunisia — dialects include many borrowed French words. Consequently, there are a large number of words that are called outside the vocabulary, which speech recognition technologies cannot understand because those words are not Arabic.
“But the field is moving at lightning speed,” says Ali. It is a joint effort of many researchers to move even faster. Ali’s Arabic Language Technology Laboratory is leading the ArabicSpeech project to merge Arabic translations with dialects that are native to each region. For example, Arabic dialects can be divided into four regional dialects: North African, Egyptian, Gulf, and Levantine. However, since the dialects are not in line with the boundaries, this can be as fine-grained as one dialect per city; for example, an Egyptian native speaker can distinguish someone’s Alexandrian dialect from his fellow citizen of Aswan (a distance of 1,000 kilometers on the map).
Building a technologically smart future for all
At the moment, machines are about as accurate as human transcribers, thanks in large part to advances in deep neural networks, a field of machine learning in artificial intelligence that relies on algorithms inspired by the way the human brain works, biologically and functionally. Until recently, however, speech recognition was somewhat hacked together. The technology has a history of relying on various modules for acoustic modeling, pronunciation lexicon construction, and language modeling; all modules that need special training. More recently, researchers have trained models that convert acoustic characteristics directly into text transcription, potentially optimizing all parts for the final task.
Even with this progress, Ali still cannot give voice command to most devices in their native Arabic. „2021. is, and I still can’t speak many machines in my dialect, “he commented. “I mean, I now have a device that can understand my English, but the machine recognition of multi-dialect Arabic has not yet happened.”
Achieving this is the focus of Ali’s work, which culminated in the first transformer to recognize Arabic speech and its dialects; one that has achieved a hitherto incomparable effect. Called the QCRI Advanced Transcription System, the technology is currently used by broadcasters Al-Jazeera, DW and the BBC to transcribe online content.
There are several reasons why Ali and his team have been successful in building these speech mechanisms right now. First of all, he says: “There is a need for resources for all dialects. We need to build resources so that we can then train the model. ” Advances in computer processing mean that computer-intensive machine learning now takes place on a graphics processor unit, which can quickly process and display complex graphics. As Ali says, “We have great architecture, good modules, and we have data that represents reality.”
Researchers from QCRI and Canary AI have recently developed models that can achieve human parity in Arabic news. The system shows the impact of subtitling Aljazeera’s daily reports. While the human error rate in English (HER) is about 5.6%, research has found that Arabic HER is significantly higher and can reach 10% due to the morphological complexity of the language and the lack of standard spelling rules in dialectal Arabic. Thanks to recent advances in deep learning and end-to-end architecture, the Arabic speech recognition mechanism manages to surpass native speakers in broadcast news.
While modern standard Arabic speech recognition seems to work well, researchers from QCRI and Canary AI are busy testing the limits of dialect processing and achieving great results. Since no one speaks modern standard Arabic at home, we need to pay attention to the dialect in order to enable our voice assistants to understand us.
This content was written by Qatar Computing Research Institute, Hamad Bin Khalifa University, member of the Qatar Foundation. It was not written by the editorial board of the MIT Technology Review.