Just In Case: A conversation about Archive-It and Duracloud

May 15th, 2015

Maria LaCalle, Web Archivist for Partner Services interviewed Kristen Yarmey, Associate Professor and Digital Services Librarian of the University of Scranton about their digital preservation strategy for  web archives and more specifically the WARC files created and managed through Archive-It. The University of Scranton utilizes the Archive-It Sync Tool for Duracloud, which provides seamless offsite backup and preservation of WARCs.

Maria LaCalle: Could you start off by telling us a little bit about your web archiving with Archive-It?

Kristin Yarmey: The University of Scranton Weinberg Memorial Library began working with Archive-It in 2012 (www.scranton.edu/library/webarchives). So far, our main focus has been on capturing and preserving the University’s web presence – including official University sites (like www.scranton.edu, admissions.scranton.edu, and athletics.sranton.edu), event-specific sites (like 125th.scranton.edu), and affiliated but external sites (like www.thescrantonplayers.com). As a result, our Archive-It collections to date are comparatively small. While we’ve crawled over 4 million URLs in the past 3 years, our archived data adds up to only 106 GB.

Recently, we’ve been dipping a toe into capturing social media sites (like the University’s main Facebook page and Twitter and Instagram accounts) as well as leveraging Archive-It for capturing news stories about our campus community. I’m also exploring the possibility of using Archive-It to support faculty research interests and to preserve web content that complements our physical special collections. With all of these projected use cases, I anticipate a significant increase in our archived data in the next few years.

ML: What were you doing for digital preservation of your WARC files prior to using Archive-It Sync?

KY: Nothing! When we were first considering partnership with Archive-It, the idea of requesting copies of our data from Archive-It was appealing, but we didn’t really have a place to put it. Signing on with DuraCloud in 2014 suddenly gave us new flexibility to accommodate large data sets in our digital preservation repository.

While I feel comfortable with Archive-It’s storage practices (with multiple copies at Internet Archive data centers), storing copies of our data in DuraCloud gives me an extra layer of security for the long term preservation of our collections. Implementing this additional “just in case” backup also assuaged some concerns expressed by stakeholders at my institution about overreliance on a single service provider.

ML: Please describe the workflow for ingesting WARC files? How much time does this take to manage and set up?

KY: What’s fantastic about this integration of Archive-It and DuraCloud is that it’s entirely automated. The Archive-It Sync tool simply synchronizes my WARC files to a designated space in our DuraCloud repository. It’s really just that easy. If I want to view or access the synced WARC files (the largest of which are chunked into smaller files), I can just log in and download them like any other materials in our DuraCloud repository.

Setting it up took less than 10 minutes – all I had to do was create an Archive-It standard user account and share the credentials with DuraCloud Support. It’s really helpful that neither Archive-It nor DuraCloud charge additional fees for the service, which makes pricing and billing very straightforward. Our 100GB of preserved WARC files simply becomes part of our regular DuraCloud annual subscription plan, which is tiered by terabyte.

ML: What types of institutions do you think this  process would work for?

KY: This kind of “set it and forget it” model works really well for my institution. I don’t have much time to devote solely to web archiving (it’s only one small part of my job), and I’d like to focus the time that I do have on selecting seeds, scoping crawls, and adding metadata.

At a more philosophical level, I feel strongly about encouraging the use and development of open, non-proprietary software, systems, and formats. It’s important to me that both Archive-It and DuraCloud give my institution the opportunity to implement open source tools while still benefiting from excellent customer support and subscription services.

ML: Are their additional directions you’d like to see us move towards in terms of enhanced digital preservation?

KY: There have been some exciting discussions lately about new tools and strategies for visualizing and analyzing content captured in WARC files. In the future, it would be great to see resources like that available to users via Archive-It and/or DuraCloud. Someday, I’d also love to be able to better integrate our Archive-It collections with our other University Archives digital collections, such that users could seamlessly explore all the various types of born digital content (from PDFs to videos to web pages) that we have preserved. I’m not entirely sure how that might work, but the vibrant creativity and meaningful collaboration I see in the web archiving community give me high hopes for the future.