Is your language a low-resouce language?

Tiya Vaj
3 min readNov 2, 2022

I found papers that title as a low-resource language, but I am still figuring out whether my language is a low-resource language. It could be based on several factors to define that particular language as a low-resource language.

Recently, I had a look at the “TwHIN-BERT” paper, which is a pre-trained model on Twitter data in a multilingual language. The dataset has been trained on high-, mid-, and low-resource languages.(determining language frequency on Twitter) This paper shows that English, Japanese, and Arabic are high-resource languages. Greek, Urdu, and Dutch are mid-resource languages. while low-resource languages are Norwegian, Danish, and Pashto. However, I believed that several papers might use different criteria to specify which one is a low-resource language.

Furthermore , I am still searching for how to define a low-resource language, and one answer on Stack Exchange inspired me, they mentioned that when there are abundant data resources available for a language, machine-learning-based systems can be developed for that language. These languages are known as “high resource languages.” The language with the most resources by far is English. The languages of West Europe, as well as Japanese and Chinese, are rather thoroughly covered. The converse is true for naturally low-resource languages, which are those that have no or very little resources.This is true for many regional dialects as well as several extinct or almost extinct languages. There are genuinely many mostly oral languages for which there are few written resources (electronic resources); for others, there are written texts but not even the most fundamental of resources, a dictionary.

Focusing on what we can do about low-resources,there are several tasks that can be performed on low-resource languages, for example, machine translation, dataset creation, multilingual embeddings, classification tasks, sentiment analysis, etc.

It is a good sign that several studies were conducted on low-resource languages, and I believe that in the future, more and more language resources will be published online and available due to the popularity of the Natural Language Processing field. However, there are some challenges that arise when dealing with low-resource languages in term of data access, evaluation methods and word representations procedure.

To sum up, there is still an unclear boundary to specify that some languages fall into the category of low-resource languages. If anyone finds any interesting materials regarding low-resource language criteria, Feel free to leave it in the comment, so it will be advantageous for me and all readers.

References:

https://www.researchgate.net/publication/363651659_TwHIN-BERT_A_Socially-Enriched_Pre-trained_Language_Model_for_Multilingual_Tweet_Representations/link/6328012f70cc936cd31b8345/download

--

--

Tiya Vaj

Ph.D. Research Scholar in Informatics and my passionate towards data-driven for social good.Let's connect here https://www.linkedin.com/in/tiya-v-076648128/