Unlocking the research potential of web archives: An ARCH cohorts update

February 10th, 2022

by the Archive-It team

Graph visualizing crawl frequency by language, generated by yhe AWAC2 cohort using pandas and Altair libraries for Python

Visualization of crawl frequency by language using pandas and Altair libraries for Python, full sample (F. Clavert)


Launched in July 2021, the Archives Unleashed Cohort Program supports and facilitates research engagement with web archives.

Bringing together multi-institutional and interdisciplinary research teams, cohort members engage in a year-long collaboration while receiving resources and mentorship from the Archives Unleashed team (including academics and Internet Archive staff) to conduct focused research using web archives as scholarly objects. 

The program’s first research teams from across North America and Europe selected a wide range of topics to study, including crisis communication, health misinformation, pandemic discourse, comparative feminism media activism, and the development of online commenting systems. They are also the inaugural group of users to pilot ARCH (Archival Research Compute Hub). ARCH is a platform for generating datasets and a major milestone of the integrative work between Archives Unleashed and Archive-It collaborators.

Over the past seven months, five cohort teams have used ARCH to generate derivative datasets from Archive-It collections for further analysis, to tackle interdisciplinary research topics. Many teams have used methods like sentiment analysis, topic modeling, and thematic coding to uncover patterns, changes, and repetition within the corpus of a web archive collection using the plain-text dataset.

But we know that web archives provide more than just the text of a website, and teams have extracted additional metadata to explore HTML data, network connections, and even branched into image analysis. Teams have also expressed interest in understanding the temporal elements of collections – for instance, how discourse or web elements change over time. 

In a recent interview, one cohort member described ARCH as a ‘gateway’ – an entry point when working with volumes of data that are too big for Excel. The ARCH platform enabled teams to gain a quick understanding of the contents of various web archives, while inspiring additional analysis based on those contents. 

Teams developed workflows using a variety of tools for additional analysis of ARCH derivative datasets, including Gephi (network graphing), IRaMuTeQ (multidimensional text analysis), and Jupyter Notebooks (a critical tool for large-scale computational analysis). In addition to testing ARCH and providing feedback, the preliminary cohort has supported each other and shared research best practices and creative solutions. 

Introducing the 2021 Archives Unleashed Cohorts

AWAC2 Analysing Web Archives of the COVID Crisis through the IIPC Novel Coronavirus dataset 

Project Members: Valérie Schafer (University of Luxembourg), Frédéric Clavert, (University of Luxembourg), Karin De Wild (Leiden University), Niels Brügger, Aarhus University, Susan Aasman (University of Groningen), Sophie Gebeil (University of Aix-Marseille)

Crisis Communication in the Niagara Region during the COVID-19 Pandemic

Project members: Tim Ribaric, David Sharron, Cal Murgu, Karen Louise Smith, Duncan Koerber (Brock University)

Project Website: https://brockdsl.github.io/archives_unleashed/ 

Mapping and tracking the development of online commenting systems on news websites between 1996–2021

Project members: Anne Helmond (University of Amsterdam/University of Siegen), Johannes Paßmann, Robert Jansma (University of Siegen), Luca Hammer (University of Siegen), Lisa Gerzen (Ruhr University Bochum). Contributors: Dave Wahl (University of Amsterdam), Steffen Reinhard (Ruhr University Bochum), and Theresa Schulte (University of Siegen)

Everything Old is New Again: A Comparative Analysis of Feminist Media Tactics between the 2nd- to 4th Waves

Project Members: Shana MacDonald (University of Waterloo), Aynur Kadir (University of Waterloo), Brianna Wiens (York University), Sid Heeg (University of Waterloo)

Viral health misinformation from Geocities to COVID-19

Project members: Shawn Walker, Michael Simeone, Kristy Roschke, Anna Muldoon, Major Brown (Arizona State University)

To learn more about the cohort’s individual projects and their progress in their first seven months, please check out the Archives Unleashed blog post: Research Applications with Web Archives: Collaboration Among Archives Unleashed Cohorts.

You can also hear more from these teams firsthand as part of the Internet Archive’s Library as a Laboratory series, beginning in March. Visit the Internet Archive blog to learn more and register to attend

On behalf of the cohorts and Archives Unleashed Project, special thanks to Archive-It partners and dedicated staff whose curatorial work has provided an opportunity for research exploration and discovery: 

  • International Internet Preservation Consortium
  • Mark Graham
  • Nick Ruest
  • Brock University
  • Duke University
  • National Museum of Women in the Arts
  • New York University
  • San Jose State University, School of Information
  • Schlesinger Library
  • Temple University Special Collections

If you’d like to learn more about participating in the 2022 Archives Unleashed Cohort, bookmark the Archives Unleashed event page. A call for proposals will open in mid-February 2022.