Launching Archive-It Research Services (Part 1)

March 16th, 2015

by Jefferson

We are excited to announce the launch of a new Archive-It add-on service that we have been working on for the last few months, Archive-It Research Services! In this Part 1 blog post (expect Part 2 in the next few weeks), we want to provide partners an introduction to  the service and give some context and background on why we have undertaken this initiative and how we think it will benefit partners and the broader community.

Archive-It Research Services — The Why

Since its launch in 2006, Archive-It has provided partners the tools to build archives and special collections of historically valuable and meaningful web content. To date, over 350 partners have created over 2800 collections totaling well over 10 billion web documents. These collections are browsable as they were the day they were captured and full-text and faceted search allow for discovery of sites, pages, and documents within collections. This access model, however, remains oriented towards studying individual resources one-at-a-time via searching, clicking, and browsing the archived web in the same way we interact with the live web.

Archive-It Research Services (ARS) aims to complement this method of access by providing new, data-oriented access models that allow for studying partner collections in aggregate and across time. By offering research datasets built from key metadata, provenance information, named entities, hyperlinks, and other elements of archived resources, ARS will enable the study of web archives using the data mining methodologies increasingly popular within the humanities, social and computer sciences, and other research  communities. It will also enable patrons and researchers to use these derived datasets for local analysis, tool-building, and in combination with other, external datasets.

The overall goals of Archive-It Research Services are to:

  • Increase use of partner collections by expanding how these collections can be accessed and analyzed by patrons, researchers, and scholars.
  • Facilitate new, data-driven forms of research, analysis, and digital humanities work using web archives to further demonstrate the value of partner web collections.
  • Allow institutions of any size access to collection-derived datasets whose creation requires the complex processing and substantial computing infrastructure that Archive-It and Internet Archive are ideally suited to provide.
  • Offer new datasets and access models to support innovation by the broader community in building new tools, interfaces, visualizations, and other outputs that can improve the creation, management, and use of web archives.

Emerging methods of data-driven research, such as studying network graphs, text and data mining, and large-scale, longitudinal content analysis, though increasingly common in many disciplines and using digitized non-web collections, have yet to take advantage of the voluminous data within curated web collections. Some notable and admirable exceptions exist, primarily using domain-level or global web crawls; but we are excited to see how these methods can leverage the curated web archives being built by librarians and archivists and what will come of pairing this increasingly popular type of data analysis with the rich historical content in Archive-It partner collections.

Archive-It Research Services — The What

ARS will launch with three available datasets that each support a variety of research methods and data mining activities. In brief, the three preliminary ARS datasets will be:

WAT: Web Archive Transformation files feature key metadata elements that represent every crawled resource in a collection and are derived from a collection’s WARC files. This includes information such as provenance (IP address, capture timestamp, HTTP headers, etc), key textual metadata (page title, metatags), outbound and embed links,  and more. WAT files are encoded in JSON for easy analysis and are 5%-25% the size of their corresponding WARC files.

: Longitudinal Graph Analysis files feature a complete list of what URLs link to what URLs, along with a timestamp, for an entire collection over time. This allows for network analysis of linking behaviors between all documents in a collection. LGA files are in a simple tab-separated format and are generally ~1% the size of an entire collection.

WANE: Web Archive Named Entities files use named-entity recognition tools to generate a list of all the people, places, and organizations mentioned in each text document (including PDFs) in a collection along with the source URL and a timestamp of when the document was archived. WANE files have tab-separated  document information and an encoded array of entity information and are less than 1% the size of their corresponding (W)ARC files.

If this all seems a bit abstruse, worry not. We will be providing far more explication of each dataset, as well as  example use cases and more service details, in Part 2 of this blog post. We have also created a full Archive-It Research Services wiki that covers all the details about the service, datasets, use cases, and more.

In the meantime, Archive-It users will notice a new friend in the menu bar of the web application, a “Research Services” link!

This link will take users to a set of pages (in the Archive-It 5.0 user interface) with additional information on the datasets and how to request the service.

We are excited to kickoff this service — the first for curated web archives! — and will be doing related talks,  workshops, and writing more blog posts throughout 2015 as we promote it. We hope that expanding access and increasing researcher utility through ARS helps further demonstrate the value of, and increase the use of, the web archives that Archive-It partners work so hard to create, preserve, and maintain.

In Part 2 of this post, we will describe in more detail the available datasets and outline some potential use cases and research examples.