Doubling the database: How research on Digital Language Equality led to 6,000 new resources for the European Language Grid

Over the course of a weekend in the middle of January 2022, the European Language Grid (ELG) doubled in size. More than 6,000 new data resources, tools and services for 87 different languages were added to the ELG platform, pushing the ELG much closer to one of its central objectives: developing into a joint European language technology platform in which ideally all relevant language resources and technologies are registered. With the update, we are now confident that the majority of resources available in Europe can be found in and through the ELG, whether they are corpora, tools, conceptual resources or models. How did that happen? A look at the beginnings of ELG’s sister project, European Language Equality (ELE), might help.

Prospering languages in a digital world

The ELE project’s main goal is to achieve Digital Language Equality in Europe by 2030. According to the preliminary definition, Digital Language Equality describes the state in which all languages have the technological support and situational context necessary for them to continue to exist and to prosper as living languages in the digital age. While this definition paints a clear and desirable picture for the future of multilingualism in Europe, the main work was still lying ahead: developing a strategic research, innovation and implementation agenda and roadmap that leads towards this desired state.

One of the key parts of the strategy agenda is the DLE metric, a measure or quantified index that allows to compare the levels of digital readiness of and across Europe’s languages. This metric combines several factors about each language taken into consideration, such as the number of its speakers, its recognition in the EU, but most importantly the level of technological support it currently receives. In order to suggest how digital language equality can become a reality for all European languages, detailed knowledge about the current state of technological support for each language is necessary. But how does one gather this amount of data for 87 different languages?

Creating the primary platform for European language technology

The task was part of the ELE investigation into the current LT support for Europe’s languages, in which 33 project partners from different countries described the status quo of their respective language, based on empirical data and findings. In addition to these national institutions with expertise in language technology, several associations such as the European Language Equality Network (ELEN) and the European Civil Society Platform (ECSPM) focussed on smaller languages within the European Union. Altogether, the ELE consortium gathered metadata from around 1,000 organisations such as LT companies, universities and research institutions in a total of 87 different languages. 4,147 new data resources and 2,216 new tools and services were identified and their metadata documented.

These are new tools and resources because the approximately 6,000 resources gathered by the ELE consortium had not been available in the European Language Grid yet, which already consisted of more than 5,000 resources from the European LT landscape. Including the additional 6,000 resources collected by ELE, the ELG platform now provides information about more than 11,000 language technology resources – either as ELG-compatible services that can be downloaded and used directly through the ELG, or in the form of metadata including links to the original hosting platform.

All data leads to Athens

The import itself was handled by the Institute of Language and Speech Processing (ILSP) of the “Athena” Research Center in Greece. The team in Athens, which forms part of both projects, coordinated the metadata collection effort, ensured the compatibility with the ELG platform, homogenised and curated all the metadata records to prepare them for the import. The new import includes both public as well as on-demand data and services, hosted directly by their providers or through platforms such as Huggingface or GitHub.

The ELE resource import represents a prime example for the effective collaboration between the two projects and the reason why we consider them sister projects: the development of the strategic research, innovation and implementation agenda and roadmap for full digital language equality requires a comprehensive and empirical overview of the current technological support of Europe’s languages. While the European Language Grid provides exactly this kind of service, the new data in return pushes it much closer towards one of its central objectives in becoming the primary platform for European language technology.

Join the European Language Grid