A Sustainable, Large-Scale, Minimal Approach to Accessing Web Archives

October 18th, 2016

The following is a guest post by Greg Wiedeman, University Archivist at the University at Albany, SUNY

This is a version of the talk I gave at the 2016 Archive-It Partner Meeting before the SAA Annual Meeting In August. The original slides are available here: http://gwiedeman.github.io/presentations/slides/Access2WebArch.html

In recent years many archives have expanded to preserve the web among other formats in their traditional collecting areas. Yet, unlike traditional formats, the best way to make web archives available to researchers is not with boxes and call numbers, but in their original environment — an Internet-connected web browser. This post discusses how at the University at Albany, SUNY, we are providing minimal access to this new type of archival material together with other formats that originated with the same creators.

First, some background about what we’re doing with web archives. The University at Albany, SUNY has been an Archive-It partner since 2012. The primary motivation to start collecting web sites was to meet New York State public records laws that require us to permanently keep many records that are published directly to the web. We aimed to capture and preserve the entirety of www.albany.edu and its subdomains, leaving access and use for later. The program became my responsibility in early 2015, but remains secondary to my other roles. Early in 2016, we expanded our collecting program in-line with our outside collecting areas – mostly New York State politics, public policy, labor, and capital punishment.

Still, contrary to our mission, we were building collections that were not very accessible to users. After all, like archives themselves, public records laws exist not just to preserve records, but to make them publically available for use. If we are just storing this documentation we are not in keeping with the spirit of the law. This problem is particularly obvious when we get researchers using permanent public records like course bulletins where our run of physical copies on our reading room shelves begins at 1845 and ends at 2014.


Archived course descriptions

Figure 1: Online Course Descriptions from 2013, a permanent record preserved with Archive-It.


Live course descriptions

Figure 2: The same URL on the live web. Course Descriptions are only up for about a year and a half. For recent years we only have these records in WARC files.


So how do we make something as complex as web archives accessible at scale while only committing minimal resources? First, we integrate them with our traditional collections in the current format-neutral access systems that we are already committed to maintaining. Right now, this is our shiny new access system for EAD finding aids that will be public in the next few weeks at http://library.albany.edu/archive. The long-term goal of this isn’t to throw everything in finding aids, but more on that later. Secondly, we can automate this integration using a Python script and Archive-It’s CDX API.


Integration with Current Access Tools

To integrate the web archives that we’ve been collecting with our EAD finding aids, we first have to envision what that looks like.

At first glance, the natural point of record for web archives seems to be the seed. We have a bunch of archival collections that have one-to-one relationships with seeds. This includes collections like the Environmental Advocates of New York Records and the New York Civil Liberties Union Records, that have a single website captured using one primary seed each. However, most of our University Archives collections relate to a group of different crawls of the www.albany.edu seed. This is a many-to-one relationship, where one seed relates to many traditional collections. It wouldn’t be practical to create a separate seed for each office and department on campus – particularly when all of these pages link together in a structure that has meaning and context.

We have to iterate through each collection in order to make these connections with our collection records, and the simplest way to do this is by using a collection management system with an open API.  We are between collection management systems right now. We are still using an old custom database for managing accessions, but this offers very little functionality and no simple way to export or work with the data. We’re planning on moving to ArchivesSpace this fall, but we’re not sure how long this implementation will take and we want to make our collections available now.

So, currently our master collection management system is a spreadsheet (not recommended). We made this temporary solution because we needed some way to automate the generation of static pages for our new navigation system that allows non-technical archivists to edit. Since we are looping through all of our collections in this way to make the static pages, it was simple to add some code to check for captures in our Archive-It collections and add them to our EAD files. All we needed to do was to add three more fields to each collection record: the URL of the related website, the Archive-It collection number that would have included the page, and the series number where the web archives components will be listed.


Collection management spreadsheet

Figure 3: Our Collection Management spreadsheet



Now that we have a simple table of our collections with basic web archives data, we can ask Archive-It’s CDX API to see how many captures of these pages are available, get extents and date ranges, and enter them into our EAD files.

The Internet Archive’s CDX server is used by the Wayback Machine to show what captures are available – it’s what populates the calendar pages for each archived URL. It’s a very open and accessible example of an API and a great introduction to how to use these tools to get data from the Internet. There is some documentation here: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server

All you do is use a simple URL to ask the server what captures are available for a URL. The format is just http://web.archive.org/cdx/search/cdx?url= plus the URL of the page you want to check for. So, for that course description page (http://www.albany.edu/history/course-descriptions.shtml) all you do is use: http://web.archive.org/cdx/search/cdx?url=http://www.albany.edu/history/course-descriptions.shtml. You can try this by just pasting this URL into any browser to see the results. Here is what it should look like:


Wayback CDX API

Figure 4: Results from the Wayback CDX API


Each line here is a different capture that is available in the Wayback Machine, so we can count the lines to see how many captures there are. There is also some other useful information here, including a timestamp for each capture, a hash, the MIME type and some other details.

There is also another Archive-It API where you can ask for all the captures of a URL in an Archive-It collection. The format for this is: http://wayback.archive-it.org/[Archive-It Collection Number]/timemap/cdx?url=. Since the collection number for our University website crawls is 3308, the URL we use here is just: http://wayback.archive-it.org/3308/timemap/cdx?url=http://www.albany.edu/history/course-descriptions.shtml


Archive-It CDX API

Figure 5: Archive-It CDX Results


Seeing this data in a web browser is nice, but we can also manipulate all of this information automatically with a simple python script using the requests library.


Python script


Here we just take the URL request we want to make as a string and get the response in another string. Then we can use Python’s string methods to iterate through all the captures and sort out what we need. Once we count the number of requests and use the timestamps to get DACS-friendly date ranges, then we can find the matching EAD file and insert this data to update or create new components using the lxml library. There is a good overview of lxml here: http://archival-integration.blogspot.com/2015/10/tools-for-programming-archivist-ead.html

I put some sample scripts up on our GitHub page here to get people started: https://github.com/UAlbanyArchives/staticPages-webArchives

As a result, we have a bunch of links to access our web archives collections in relevant finding aids, presented with other materials that originated with the same creator. The extents and date ranges update automatically every week. You can see what this looks like for the University Council Records here: http://meg.library.albany.edu:8080/archive/view?docId=ua100.xml

You can directly access the Web Archives series here: http://meg.library.albany.edu:8080/archive/view?docId=ua100.xml#nam_ua100-1


Future Developments

Using these scripts, we can automatically insert new records for web archives into our EAD finding aids at scale. This provides minimal access to our web archives collections within the context of our traditional collections and though the same methods of discovery. All of this can be done without any manual effort going forward. Our collection pages are automatically updated as we make new crawls from within Archive-It.

Yet, the value here is the overall approach rather than the method. Obviously, relying on spreadsheets for holding even basic collection data is not sustainable. Automatically editing collection data in EAD files based on data you get over the Internet is also not a great tactic. The only reason we are able to do this now is that our EAD files are extremely consistent, they have all been created relatively recently, and they are governed by strict rule-based validation. We still had to go through a months-long project to clean up our legacy data. In most real-world cases, automating description with EAD like this is not labor-efficient and often unsustainable.

The longer-term solution is to automate updates to description using a system governed by a fundamental data model and accessible through an open API. ArchivesSpace could be an effective solution to this problem. Instead of XML and spreadsheets, our script can use the ASpace API to check for web archives resources, get the basic information we need, ask the Archive-It API what captures are available, and post that information back into ASpace. Ideally, we could also get information on how web pages were captured if Archive-It can expose data on seeds, crawls, and scoping rules in the future. In an ideal world, this information could be posted back into ASpace as machine-readable provenance data. What this means is that we could edit resource records in ASpace to include some basic web archives information, and that resource would be automatically populated with date range and extent information and assigned to a digital object with links to archived web pages hosted locally or with Archive-It. This ASpace data would then be automatically updated as we continue to crawl the web, and continually exported to a public access system where users could get fairly quick access to the pages captured by our crawls.


ArchiveSpace diagram

Figure 6: Rough top level overview Data Flow Diagram of what a theoretical ArchivesSpace implementation could look like.


Finally, although the methods we used were helpful for creating many digital object links to web archives in our collections all at once, this came at the cost of ignoring much of the theory surrounding archival description. “Web Archives” series do not comply with DACS, as we conflated the intellectual content of web pages with the format of web archives. Figuring out how content from web archives should be described is a fundamental problem that we have avoided for now. This problem has both easy fixes and wider conflicts, but once there is a consensus on a path forward, the technical barriers between web archives and archival description will soon fall.