State of the WARC: Our Digital Preservation Survey Results

January 5th, 2016

by Jefferson Bailey & Maria LaCalle

We circulated a survey in June 2015 that asked Archive-It partners about their local digital preservation activities involving WARC files. With this survey we hoped to gain a better understanding our partners’ current workflows, near-term planning, and long-term goals for the local digital preservation of WARC files. Fifty partner institutions responded, representing institutions with both large and small web archiving programs. With such an extensive and diverse set of responses, we wanted to share some of what we learned from the survey.

One of the main takeaways from the survey was that partners are generally not downloading their WARC files for local preservation.This mirrors findings from the NDSA 2013 Web Archiving in the United States Survey (PDF). The percentage of those locally preserving their WARCs was the same across both our survey and NDSA’s, with 20% of respondents reported ingesting their Archive-It WARC files into a local digital preservation system and/or long-term repository. A much higher number, however, 53%, have plans to download their WARCs for local preservation in the future. This was notably higher than somewhat-older NDSA survey, in which only 23% of institutions reported plans to download data in the future. This suggests that, although further efforts will be necessary for the web archiving community to advance distributed digital preservation, the future looks promising; since the last NDSA survey (another NDSA survey is forthcoming) local preservation of archival web data is receiving increased attention as a part of web archiving programs.

We also asked which systems were being used among those institutions that are using digital preservation systems to manage and preserve their WARC files. A variety of proprietary, open-source and homegrown systems were described in the responses, with the most prevalent being Archivematica (18%), followed by Fedora (15%), homegrown systems (15%), Islandora (9%), DuraCloud and DSpace (6%), and 25% of respondents not yet having a local preservation system for web archives.




While local digital preservation of WARC files may not be on the immediate horizon for many programs, especially newer programs still establishing workflows and policies, plans for future activities in this area appear to be in the works. Over 60 percent of respondents described local digital preservation initiatives under consideration at their institutions.

The creation of preservation metadata for WARC files is fairly uncommon, with little uniformity among those institutions that reported creating metadata. Some form of high-level metadata for WARC files was described by 14% of respondents. Several reported using Dublin Core (which is also Archive-It’s default descriptive metadata template), while one partner described a process of extracting data from WARCs and storing this as technical metadata.

Those interested in the topic of preservation metadata for WARC files may want to check out our presentation to the ALCTS PARS Preservation Metadata Interest Group at this summer’s ALA conference, “Don’t WARC Away: Preservation Metadata and Web Archives,” in which we outlined some of the challenges and possibilities surrounding preservation metadata and web archives.

Here at Archive-It we are also continuing to develop tools, processes, and partnerships with digital preservation service providers with the goal of facilitating more distributed and local preservation of WARCs by partners. As well as existing tools like our WARC download portal and partnership with Duracloud, we are exploring new storage architectures, transfer utilities, and collaborations expanding digital preservation of web data. Our work as part of a recently-awarded grant from IMLS will include building new tools such as export APIs, making it easier to have programmatic, configurable access to downloading archived web data from Archive-It for preservation into local systems.

Along with new tools, projects, and partnerships, we also plan to hold webinars and trainings in 2016 to help partners better understand the methods available to them to download and preserve their web data locally. And we’ll also run this survey again next year in order to track how we, and the community, have progressed on distributed digital preservation for our web archives. More soon!