Crawling COVID-19: What (and how) web archivists collect

September 10th, 2020

by Karl-Rainer Blumenthal, Web Archivist for Archive-It

Web archivists have already collected over 40 TB of web data from thousands of sites about the Novel Coronavirus (COVID-19) and its effects on their communities, research areas, and institutions. The scale, specificity, and urgency of the topic begs us to ask new questions about some old collecting routines, to align tool development with the most pressing needs, and to imagine altogether new kinds of web data access and usage. Thankfully, Archive-It partners and their peer web archivists are always most eager to consider these kinds of issues together. They met virtually on August 26 to ask questions, compare notes, and learn generally from each other’s experience. The fourth in a series of previously smaller and more foundational calls with web archivists, this conversation revealed some of the overarching themes that define collecting and preserving COVID-19 information on the web thus far:

The best practices apply

Much of the COVID-19 web archiving story is the story of web archiving generally, just concentrated. That is, web archivists find that their community standards and in-house procedures bring necessary structure and process to their COVID-19 collecting specifically, even if it seems often like everything is happening all at once! Take for instance Alex Thurman, the Web Resources Collection Coordinator at Columbia University Libraries, who joined the August call to share his perspective from inside the Content Development Group of the International Internet Preservation Consortium (IIPC), a partnership of 58 research libraries and other web archivists around the world. The IIPC’s Novel Coronavirus (COVID-19) collection was the first curated web archive collection about the pandemic to launch, in spite of its ambitiously global scope of coverage, because Alex’s working group could rely upon some proven procedures for balancing such a load. As with prior collaborative collections, they use a spreadsheet to collect seed URLs and descriptive metadata from participating IIPC members, as well as a public submission form to help widen the scope of the collection beyond IIPC member countries.

Established rights management policies are also now being applied to new collections. Web archivists frequently employ an “opt-out” approach towards creators or site owners–notifying them about the collection and the crawls so that they may decline to be included if they so prefer. The opt-out methodology is popular among research and special collections libraries generally, but some web archivists use the alternative “opt-in” approach, to refrain from collecting any web data without their creators’ explicit permission. Melissa Wertheimer, Music Reference Specialist, commented on how this served to focus the selection of performing arts contents for the Library of Congress’s COVID-19 web archive collection, as it has done among other, earlier established collections.

Web archiving for web formats

The lone arranger with a more specific curatorial mandate isn’t without their own precedents and practices, though. Zakiya Collier, Digital Archivist at the Schomburg Center for Research in Black Culture, the New York Public Library, was already responsible for several web archive collections that represent the African diasporan experience online. In this case, that digital space is less a parallel publishing platform for institutions than a subject in and of itself, with context and character of its own to capture and represent. The Schomburg Center has, for instance, paid particular attention to archiving primary and secondary sources that are uniquely formatted to the web, notably in the case of their #HashtagSyllabusMovement collection. Curating a collection to build upon these existing strengths means focusing the Novel Coronavirus COVID-19 collection on blogs, Substack newsletters, Google docs, and social media, to start with.

Screenshot of the Schomburg Center's June capture of a Google sheet for Black-owned food services

The Schomburg Center’s June capture of a Google sheet for Black-owned food services

The vicissitudes of social media are no new challenge to web archivists, but they raise important questions to answer before embarking on any large collection of COVID-19 materials in particular, ie.: how do we handle the creators’ privileges on these platforms? And what significant properties of a social media platform must be preserved for access? Zakiya for one takes a wait-and-see approach, again treating COVID-19 materials like established collections that will matter as much if not more in the long term. Posts that represent the Black Twitter experience are for instance collected and held privately until a more universal policy governing their access to researchers might be developed. Others limit their scopes to the institutional social media accounts that they already have a broad mandate to collect, and/or avoid the platforms that present technical problems like blocking web crawlers or remove important context from the public views that crawlers capture.  

Web archivists also want to preserve the new ways that people use and experience the web during the global pandemic. Dashboards, mapping interfaces, and data visualization applets are especially ubiquitous and necessary to capture the contemporaneous understanding of COVID-19 in regions, institutions, and cultures. These dynamic web applications present many technical challenges to the web archiving tools that were built to preserve the appearance of web pages at discrete moments in time. Some web archivists use additional acquisition tools outside of the Archive-It software suite to accommodate these special data resources immediately while others wait for the existing tech stack to catch up to the latest developments from the “live” web.

Screenshot of the IIPC’s June capture of COVID-19 data reported from Palestine

The IIPC’s June capture of COVID-19 data reported from Palestine

Metadata is for sharing

Marge Huang, the Martha Hamilton Morris Archivist at the Philadelphia Museum of Art, had no local web archiving legacy to lean on when she led the effort to create a first collaborative web archive among the members of the Philadelphia Area Consortium of Special Collections Libraries (PACSCL). In order to cover a metropolitan area, PACSCL’s collections distribute curatorial control among a dozen and counting local institutions, enabling each to apply its unique expertise to COVID-19 content selection and processing. Building on the strength of their existing network, PACSCL members instead collaborated closely on the shared guidelines for metadata and description that each could apply in order to enable seamless exploration across any of the institutional boundaries that exist among the collections.

Web archivists have long worked to develop descriptive standards that can apply even more universally. Many can or have already in this case relied upon the Descriptive Metadata for Web Archiving recommendations published by OCLC Research’s Web Archiving Metadata Working Group. Foundations like this one will help us to establish some new shared practices in the near term as well. In the case of COVID-19, Archive-It partners still need to collaborate on a modicum of standard descriptive values in order to unify any amount of general public access to their disparate collections more even broadly, so PACSCL’s, precedent is very instructive in the meantime.  

Screenshot of a metadata record from the Pennsylvania Horticultural Society COVID-19 Collection

Metadata record from the Pennsylvania Horticultural Society COVID-19 Collection

What will that access look like, though? All digital preservationists and their patrons have skin in this game, so we look forward to including them in more conversations about the future, starting at the 2020 Archive-It Virtual Partner Meeting. Please join us if you can! And look forward with us towards more opportunities to discuss these special collections and what they teach us.