Project WARC-Speed: Challenges and opportunities for web archiving programs

September 14th, 2021

By Grace McGann (Moran), Teen Librarian, Tipp City Public Library

As a Graduate Web Archiving Assistant during the 2020-21 academic year, I was tasked with evaluating the University of Illinois Archive-It partnership and creating a plan for its continued success. My final recommendations to the Associate Dean for the Office of Digital Strategies were shaped by the myriad challenges that face a burgeoning web archiving program. The opportunity was well-timed to inform my strategy as I continue web archiving through the Internet Archive’s Community Webs program at the Tipp City Public Library. Below, I share a bit of what I learned and take forward with me.

Through exploration of the external literature, internal policies, and my own hands-on experiences, I learned about four main challenges to a growing web archiving program: preservation intent, metadata standards, end user access, and consistent program stewardship.

Diagram of challenges to growing web archiving programs

Trevor Owens describes the concept of preservation intent in his book, The Theory and Craft of Digital Preservation. In the web archiving context, preservation intent means knowing the significant properties of web resources that are important and necessary to preserve in order to meet the end user’s needs. To narrow down the possibilities here, you can for instance ask yourself:

  • What information from this website am I looking to preserve?
  • Do my web crawls need to collect the other websites that it links to as well? If so, which ones are relevant?
  • Do I want full functionality of the page, or just enough to make the information available?
  • What do I want the end user’s experience with this captured website to look like?

The answers to these questions dictate how to configure a web crawl and what an end user sees when they access the archives.

When I began investigating University of Illinois collections, I found a paucity of access points. One catalog record pointed patrons to a single University of Illinois at Urbana-Champaign Web Archives collection; one library’s website linked out to its own collections. Otherwise, users  would have to navigate to the public-facing Archive-It website and either search for “University of Illinois” or stumble across a University of Illinois collection as a result of another search. I learned quickly that the University was not alone in this. Many institutions struggle with collection accessibility and searchability. These are some of the most time-consuming yet necessary aspects of web archiving.

Important: Metadata and end-user access are two deeply related issues

Metadata and end user access are deeply integrated. How we describe something today impacts its accessibility into the future. The OCLC Research Library Group’s Web Archiving Metadata Working Group conducted extensive research into current practices and needs for description in web archiving. They recommended implementation of the following 14 fields, which I proposed the University of Illinois adopt for seed-level description:

  • Collector
  • Contributor
  • Creator
  • Date
  • Description
  • Extent
  • Genre/Form
  • Language
  • Relation
  • Rights
  • Source of Description
  • Subject
  • Title
  • URL

I now plan to adopt these fields into my policy for description at the Tipp City Public Library. Since I am building from the ground up, I am able to ensure consistency across the board and document everything I do with detail in the form of policy documents.

The long term viability of web archive collections depends upon consistent stewardship, both in terms of people and policy. Archive-It’s Web Archiving Life Cycle Model suggests that policy should be the guiding arm of every web archiving program. During my time working with the University of Illinois web archives program, I learned that general library policy for collection development, staffing, etc. is not sufficient for the governance of a web archiving program. The materials and the challenges are too unique for a blanket policy to cover. My recommendations therefore included implementing three policies specific to the web archiving program:

  • A collection development policy
  • A standardized workflow detailing the process of web crawls, troubleshooting, and description
  • A statement on the issues of copyright and ethics in web archiving

The care and keeping of web archive collections ensures a useful and substantive electronic historical record. As our lives are shared over a constantly evolving web, we need to preserve important information about our society and its history for the future. With clarity on our intentions, commitment to user access, and ongoing management, we can safeguard this opportunity to tell today’s history tomorrow.