2016 State of the WARC: Our Second Annual Digital Preservation Survey Results

November 29th, 2016

by Jefferson Bailey & Maria Praetzellis

Last year we launched an annual digital preservation survey in order to get a better sense of the Archive-It community’s practices, tools, and goals related to the local preservation and management of their web archive collections. You can see last year’s results in the original State of the WARC blog post. We have wrapped up the 2016 survey and can now report on this year’s results.

Summary findings:

  • The percentage of partners receiving their WARCs for local preservation and management fell slightly overall, to 13% from 2015’s 20%.
  • The number of respondents creating metadata for their WARC files also fell slightly with 15% stating that they create this metadata, compared to 20% in 2015.
  • The landscape of tools and systems used locally for WARC files remained consistent, with a wide range of systems mentioned, suggesting interoperability will remain a key need for data transfer.
  • Related to our IMLS grant exploring API-based systems interoperability, we asked partners what WARC-specific data would best support a data transfer API to facilitate local ingest of a WARC files. Collection and seed were the most popular elements, with crawl and date range also popular choices.

This year’s survey had 55 respondents, a 10% increase over last year’s 50 responses. Interestingly, 23 of those 55 (42%) had data budgets of 1 terabyte/year or higher, suggesting that larger programs had a greater tendency to respond to the survey’s questions regarding local web archive preservation efforts. This makes sense, as larger programs are more likely to have the resources and technology to preserve — and plan to preserve — their web archives; but it also suggests that there is fertile ground for working with many smaller programs to support custodial preservation activities for web archives.

While the overall number of responses to the survey was up slightly, the overall percentage of partners who are downloading their WARCs for local preservation dropped slightly from last year’s numbers. 13% of respondents to this year’s survey download (or have shipped on drives) their WARCs, 33% plan to do so soon, and 54% are not doing so. In 2015, 20% ingested their WARCs into local systems and 53% planned to do so soon. (Since “do you plan to…” was a separate question in 2015, instead of an option alongside yes/no, as in 2016, there is some minor variance between surveys). This consistency in statistics is also reflected in the soon to be released 2015 NDSA survey, which found that the number of institutions transferring files into local repository remained around 20%, matching the level reported in their 2013 survey.

 

StateofTheWARCDoYouDownload

 

The number of respondents creating metadata for their WARC files also remained mostly unchanged, with 85% stating that they do not create metadata for WARC files, as compared to 80% last year. For those ingesting, managing, and storing their WARC files locally, the referenced systems were similar to last year’s survey and included ArchivesSpace, DuraCloud, APTrust, Hydra/Fedora, Archivematica, and, most commonly, undefined local systems. When asked what systems respondents would like to be able to use in the future, all services proved popular. With this in mind, interoperability appears to be key for Archive-It partners as many have yet to develop local preservation requirements or are still in the stage of implementing preservation systems or including web archive data into existing workflows.

Interoperability is one of our goals at Archive-It and guides our development of APIs for access to Archive-It data and WARC files, including the aforementioned data transfer APIs, as well as descriptive metadata APIs and enhancement of existing APIs around CDX, OpenSearch, and other data. This year’s survey included several questions designed to inform development of APIs that support the transfer of WARCs (and associated derivatives) for local preservation.

When asked to select all associated data or metadata that partners would like to see accompany a requested batch of WARCs, partners favored descriptive metadata (85%), however all categories proved popular. This included having the ability to include many types of data such as content indexes (80%), crawl configuration (75%), and crawl logs (61%). Again, this points to a desire for multiple APIs for programmatic access to multiple data domains.

In terms of how partners would prefer to identify which WARCs to download, respondents expressed interest in listing WARCS by collection (90%), seed (90%), specific date range (76%), crawl (69%), and URL (65%). Asked how APIs would improve the ability to integrate web archiving systems or tools, respondents had a range of interesting use cases. Many noted the potential for greater automation, systems integration, and overall ease of ingest. Interestingly multiple respondents noted a transfer API would facilitate their ability to provide downstream users and researchers better access to working with their data in WARC or in derivative format, a use case explicitly built into the current draft general API specification.

Over the next year we plan on continue work developing new tools and partnerships  that support partners preservation needs for web archives.  We’re looking forward to the 2017 State of the WARC!