This Microsoft project is enabling the digital presence of low-resource languages

A woman named Boa Sr was the last link to a 65,000-year-old pre-Neolithic culture on the Andaman Islands in the Indian Ocean. When she died in 2010, the Bo language died, too, becoming extinct.

If that sounds like an isolated incident, it isn’t. Every two weeks, a language is lost somewhere in the world.

Take the Mundas, a community of about a million people spread across the eastern Indian states of Jharkhand, Orissa and West Bengal.

“I learnt Mundari very late in life as my parents lived in another state where they were working, so we didn’t speak the language at home,” says Dr Meenakshi Munda, a member of the Munda community and an assistant professor in the anthropology department at a university in Ranchi, Jharkhand. “I understand how identity matters for a community and our younger generation is losing its identity because they don’t know their language.”

The Munda community is concerned about the longevity of their language as only prominent languages like Bengali, Hindi and Odiya are taught to kids in schools.

Also Read: Chinese search giant Baidu to launch ChatGPT-style bot

While there’s a written script for Mundari, it has negligible digital content or presence online, giving even fewer incentives for people to invest in learning the language.

A handful of researchers at the Microsoft Research (MSR) lab in India have been working toward creating digital ecosystems for languages, like Mundari, that don’t have enough presence in the digital world.

“The way I define my job for myself is that no person in this world should be excluded from using any technology because they speak a different language,” says Kalika Bali of MSR India.

Bali is an expert in Natural Language Processing, the subfield of linguistics and artificial intelligence (AI) that focuses on training computer systems to understand spoken and written languages.

Her team works with local communities and native speakers to create the base datasets that will be used to build AI technologies for low-represented languages. By involving the community in the data collection process, they hope to create a dataset that is both accurate and culturally relevant.

The internet’s language, since its earliest years, has been English. Since then, with improved access to the internet and demand for content in native languages, seven other widely spoken languages — including Chinese and Spanish — can somewhat match English in terms of technological compatibility. But that’s only eight out of nearly 6,000 languages around the world.

Also Read: App alert: This security search engine will help you spot high-risk applications

This means 88 percent of the world’s languages do not have enough of a presence on the internet. It also means that a whopping 1.2 billion people — 20 percent of the world’s population — can’t use their language to navigate the digital world.

“As a result, the distinction between haves and have-nots became pretty stark,” explains Monojit Choudhury, principal data and applied scientist at Microsoft’s Turing India and Bali’s colleague.

The researchers call languages that do not have the resources required to build technology for a digital presence “low-resource languages.”  Under Project ELLORA— Enabling Low Resource Languages — building digital resources has a dual purpose: First, it is a step to preserving a language for posterity; and second, it ensures that users of these languages can participate and interact in the digital world.

Project ELLORA, launched in 2015, began with basics. The first step was to map out what resources were already available, such as printed material like literature and the extent of a digital presence. In a 2020 paper, Bali and her colleagues outlined a six-tier classification, with the top tier representing resource-rich languages like English and Spanish, and the bottom tiers reflecting languages with little-to-no resources.

The work of Project ELLORA is collecting the required resources for these languages and building language models to meet their speakers’ digital needs.

Project ELLORA’s researchers work with the communities to define what this need is and what base technology can help fulfill it. “No language technology can be isolated from the people who are going to use it,” says Bali.

Also Read: QR Code 2.0 | Japanese engineer Masahiro Hara says new design will have increased colour & information storage

For Mundari, the researchers collaborated with IIT Kharagpur in 2018 and sponsored a study to find what the community needs to keep the language alive.

What started off as a simple vocabulary game for school children to get them to learn the language soon morphed into sophisticated technology projects.

MSR researchers are currently working on a Hindi-to-Mundari text translation as well as a speech recognition model that will provide the community access to more content in Mundari.

A text-to-speech model, funded under the “Forward – Artificial Intelligence for all” initiative by the Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) on behalf of the German Ministry for Economic Cooperation and Development, is also in the works.

But creating language translation models for a language that doesn’t have any significant digital content to train machine learning models is no easy feat.

The team, led by professors of IIT Kharagpur, initially worked with members of the community to have them manually translate sentences from Hindi to Mundari.

Also Read: Netflix cracks down on password-sharing and is prepared for unhappy customers

To speed up the translation, MSR researchers developed a new technology called Interneural Machine Translation (INMT), which helps predict the next word when someone is translating between languages.

“It (INMT) allows humans to translate from one language to another more effectively. If I’m translating from Hindi to Mundari, when I start typing in Mundari, it gives me predictive suggestions in Mundari itself. It’s like the predictive text you get in smartphone keyboards, except that it does it across two languages," Bali explains.

To build the dataset for text-to-speech, they collaborated with Karya, which started off as a research project by Vivek Seshadri, a principal researcher at MSR. Karya is a digital work platform for capturing, labeling and annotating data for building machine learning and AI models.

The team identified a male Mundari speaker and Dr Munda as the female speaker, who were given the translated sentences to record. They recorded the sentences on the Karya app on Android smartphones.

The recordings, along with the corresponding text, are securely uploaded to the cloud and are accessible for researchers to train text-to-speech models.

“The idea is that between Microsoft Research, Karya and IIT Kharagpur, we will have data for machine translation, speech recognition and text-to-speech synthesis, so that all these three technologies can be built for Mundari,” elaborates Bali.

Also Read: This platform connects influencers with brands — how it's a win-win for everyone

These connections between language and technology are basic building blocks that eventually could enable sophisticated systems like translation services on government websites or streaming platforms. These systems are already a reality for the language you are reading this article in.

The Munda community is not the only one incorporated in Project ELLORA‘s work. Other native language development efforts include:

Aiding Gondi speakers, very few of whom understand other languages, gain access to information. Project ELLORA worked with partners CGNETSwara and IIIT Naya Raipur, to build Adavasi Radio, a hub where news, videos and books can be accessed. The team produced 60,000 parallel sentences between Gondi and Hindi, which has led to the development of a machine translation service.

Working with the Idu Mishmi community in Arunachal Pradesh, in north-eastern India, to create a framework for a digital dictionary for the Idu Mishmi language, which now has less than 12,000 speakers. The digital dictionary will be used in schools to teach Idu Mishmi to children.

“We want to shorten the time cycle that it might otherwise take for these languages to have enough data to take advantage of the technology,” Bali says. “If AI can do all these wonderful things for speakers of English, then it should be able to do all these wonderful things for any other human being who does not speak English.”

Also Read: Exploring Seoul’s newly opened metaverse city and others like it

First Published: Jan 30, 2023 9:30 PM IST

Home

Live TV

CNBC-TV18 Specials

Sections

Terms and Conditions

This Microsoft project is enabling the digital presence of low-resource languages

Project ELLORA, launched in 2015, began with basics. The first step was to map out what resources were already available, such as printed material like literature and the extent of a digital presence.

By Pihu Yadav Jan 30, 2023 9:32:35 PM IST (Updated)

Tags

UP constituencies to witness three-cornered fight in second phase tomorrow

BJP MP's wife challenges him in electoral battle for Etawah seat

Lok Sabha Election 2024: Gurugram gears up for crucial polls amidst economic boom and civic woes

Lok Sabha Election 2024: Crucial seats up for grabs as Rajasthan, Maharashtra, Bihar gear up for 2nd phase of polls

Share Market Live

Live TV

CNBC-TV18 Specials

Sections

Terms and Conditions

This Microsoft project is enabling the digital presence of low-resource languages

Project ELLORA, launched in 2015, began with basics. The first step was to map out what resources were already available, such as printed material like literature and the extent of a digital presence.

By Pihu Yadav Jan 30, 2023 9:32:35 PM IST (Updated)

Tags

UP constituencies to witness three-cornered fight in second phase tomorrow

BJP MP's wife challenges him in electoral battle for Etawah seat

Lok Sabha Election 2024: Gurugram gears up for crucial polls amidst economic boom and civic woes

Lok Sabha Election 2024: Crucial seats up for grabs as Rajasthan, Maharashtra, Bihar gear up for 2nd phase of polls

Most Read

Share Market Live