Synoptic web archiving: Preserving a multilingual environment

July 8th, 2021

by Silvia Sevilla, Publications Office of the European Union

The Publications Office of the European Union was established in 1969 with the aim of publishing the European legislation and jurisprudence. The Publications Office (known as OP, which stands for the French term “Office des Publications”) assumed also the role of publisher of any other type of information created by the European Union. Today, the OP is much more than a traditional publishing house. The many services that we provide for the European institutions include production, editing, identification, printing, distribution, access, use and reuse, archiving and preservation.

The EU has three alphabets (Cyrillic, Greek and Latin) and 24 official languages. Adoption of a single EU language has sometimes been considered, but democracy, transparency and accountability require that all EU citizens understand clearly what is being done in their name and have access to EU legislation, procedures and information in their own languages. Consequently, no legislation could enter into force until it had been translated into all official languages and published in the Official Journal of the EU.

Timeline with Languages of the European Union

European Union Official Languages

The treatment of the translations in the 24 language versions has a synoptic approach. This means that each language version has identical structure and can be consulted in parallel. For example, the same text can be found on the same page in all official languages:

Farm to Fork website in English and Spanish

“Farm to Fork” website in English and Spanish

The 24 official languages of the EU are its public face. Internally, the institutions operate in three working languages: English, French and German.

EU Web Archive

The concern for the preservation of EU web archives started in 2013. In 2018, the Publications Office took over this task, accordingly with the mandate to preserve all publications of the European institutions.

After working with different contractors who carried out the EU web archive on our behalf, we decided to look for a tool that we could operate to get more flexibility on the selection of the seeds, scoping and crawling. Then, we became an Archive-It partner. All the content we had captured since 2013, which was stored on external disks, was compiled and ingested in the archive.

Thanks to the Waybackfill Service provided by Internet Archive, we integrated the historic archives of the European Union existing in the Wayback Machine. We found web pages stored in the Internet Archive since 1996 and all this material was incorporated in 2020.

Official Journal of the EU capture from 2005, ingested using Waybackfill

Official Journal of the EU” capture from 2005 ingested using Waybackfill

We archive the websites hosted on the domain. About 250 top websites are captured at least four times a year. Despite the reduced number of seeds, the volume of the data is important because many of these URL are published in 24 or more languages.

The main challenge when capturing web pages in different versions is the language selectors’ complexity. The language menus in the top-right corner of the web pages often use POST requests to reload pages in different languages, and this is not easy to support.

Sometimes, even if the crawler is able to capture the multilingual content, these cannot be reached using the language selector or navigating the page. We realize that the content is archived but the only way to reach it is to change the language code in the URL.

There are several solutions to this, and some of them go through the adjustment of live websites. In the more difficult cases we can also multiply the seeds, creating one for each language version. This is not an ideal situation, but at least it allows us to display the content in all available languages.

The next graph analyses the number of versions in which our web pages are published:

Pie chart showing percentages of each group of languages in the EU web archive

EU Web Archive Languages

  • 41% of our websites are only available in English. These are the ones with content frequently updated or websites with a short lifespan.
  • 13% of sites are in several languages, usually in the working languages, English, French, German, and in some more versions. These are normally Italian, Polish or Spanish, the next most widely spoken languages in Europe.
  • 39% of the archive is available in 23 or 24 language versions. These are the websites dedicated to general information, those that include official documents and European Union policies. (Sometimes we find 23 rather than 24 linguistic versions because primary legislation written in the Irish language may be the only resources that treat it as an official language.)
  • Finally, 7% of sites are available in 25 or more versions because their content or target users go beyond the borders of the European Union. This happens, for example, in topics like migration, culture, or environment. Examples of these languages are: Albanian, Arabic, Bosnian, Chinese, Icelandic, Macedonian, Norwegian, Serbian or Turkish.

We are pleased to be able to give public access to the archives of the domain in users native languages. Looking ahead, we are working on the design of a new home page, using the APIs offered by Internet Archive to create our own custom front end to these multilingual resources. We will also aim to integrate the WARC files into our long-term repository where we preserve all publications of the European Union in multiple formats.