Better together: Collaborative web archiving at IIPC

June 28th, 2016

by Karl-Rainer Blumenthal

This is the first in a series of blog posts on the topic of collaborative collecting projects within the Archive-It community. Many of our partners, be they independent organizations or consortia, find it beneficial to share resources, seed lists, policies, or other tools of the web archiving trade. We’ll highlight just a few examples of collaborative collection building and will continue the conversation with all partners interested in contributing to group efforts.


To get the ball rolling, I spoke with Alex Thurman, the Web Resources Collection Coordinator at Columbia University Libraries. With the Library of Congress’s Abbie Grotke, Alex co-chairs the Content Development Working Group of the International Internet Preservation Consortium (IIPC). IIPC is the international membership organization that brings together national libraries, universities, and other web archiving organizations in order to set collective goals and priorities. Internet Archive is a founding member of IIPC and currently serves on its Steering Committee, and in the Vice Chair role. You may know IIPC by its work on resources like the WARC file format standard and OpenWayback, or for its annual General Assemblies. Recently, however, it has also marshalled the combined resources and expertise of its members in order to build an archive of internationally relevant collections.


Karl-Rainer Blumenthal: IIPC represents a lot of institutions with diverse interests in archiving the web. What benefit(s) did it see in joining forces to build web archives together, in addition to its members’ many separate collecting efforts?

Alex Thurman: The IIPC’s collaborative collecting task force, now Content Development Group, identified several key benefits, but the most important were:

  • Wider access to the archived content
  • Larger and more diverse global collections
  • Quicker collecting in response to developing world events
  • The value of the ensuing collections for promoting web archiving through outreach to general users, researchers, institutional policy-makers, and web content producers

The IIPC had already seen the value of multi-institutional, multi-national collecting in its previous ad hoc collaborative efforts to archive content related to the Olympics in 2010, 2012 and 2014. But even these jointly created collections, crawled pro bono for IIPC by one of its members (the Internet Archive), were not yet publicly accessible. Also, when individual members created collections on other subjects or events of global significance, their crawl scope and/or ability to make the archived content public was sometimes limited by their particular national legal deposit environment. Increased access to content was therefore perhaps the largest factor in IIPC’s decision to build selective joint collections with a web archiving service.


IIPC Archive-It account


KB: There’s no shortage of internationally important subjects that this body could archive. What has it decided to focus its resources on collecting so far, and why? Are there any guidelines for what makes a good collecting topic, or for how the group identifies specific content to archive?

AT: The Content Development Group established some basic criteria for new collaborative topical or event-based collections. Proposed collections should be:

  • Of high interest to IIPC members
  • Broader than any one member’s responsibility or mandate
  • Of higher value for research due to broader perspective provided by multiple institutions’ participation
  • Transnational in scope

Given these guidelines, any Content Development Group member (CDG is open to all IIPC members) can propose ideas for new collections. If there is sufficient interest and a volunteer to lead curation, then we proceed with identifying seeds.

The three main collection topics we’ve pursued in our first year of collecting were World War I Centenary Commemoration, the European Refugee Crisis, and International Cooperation Organizations.




The World War I Commemoration collection arose organically as several IIPC members (including the Bibliothèque nationale de France, the Library of Congress, the British Library, Library and Archives Canada, and the National Library of Australia) were each already independently building a web archive devoted to the 100th anniversary of World War, focused on content from their country. By building a joint IIPC collection, providing access to new captures of all of their chosen seeds as well as new seeds contributed by other members, we’ve been able to create a much more well-rounded resource for researchers and other users. It’s one that we hope to continue building through the conclusion of WWI’s anniversary in 2018.


European Refugee Crisis


In the summer and fall of 2015, the news was dominated by the story of the rising numbers of refugees from the civil wars in Syria, Iraq and Libya arriving in Europe. This topic was proposed and by October we had 500 seeds submitted by 11 member institutions. Additional seeds were later added and the whole collection was re-crawled earlier this spring.

The International Cooperation Organizations collection supports the IIPC’s mission of promoting “global exchange and international relations” by collecting the web presence of intergovernmental or other organizations whose international scope places them outside the normal collecting mandate of individual national libraries or even university libraries. We began by attempting to comprehensively crawl all identified seeds from the small but important subdomain “.int” (for “international”) which includes the websites of such organizations as NATO, the Organization of American States, African Union, Interpol, World Health Organization, World Trade Organization, and about 120 others.

Right now we’re gearing up for a new collection on the Rio 2016 Summer Olympics and Paralympics (for which we will consider seed nominations from the public) and planning a collection of news websites from around the world.


Olympics and Paralympics


KB: Was it a foregone conclusion to use Archive-It for collaborative collecting, or were there some specific technical and/or administrative features that led the group to that selection?

AT: The precursor group to the Content Development Group did a study of available web archiving service vendors, evaluated nine tools/services with further close consideration of three contenders, and ultimately recommended Archive-It. The decisive factors were Archive-It’s more developed functionality for descriptive metadata, its use of common platforms (Heritrix for crawling and Wayback for replay) and our assessment that though Archive-It’s up-front costs were higher, they were ultimately more sustainable because they were all-inclusive of harvesting, indexing, access, storage and technical support.


KB: It’s been about a year since IIPC launched its first Archive-It crawl. How have the benefits of collaborative collecting lined up to your expectations so far? And the challenges? Any sage advice that you can share with other institutions that may want to begin archiving together as a team?

AT: I’ve been pleased at the number of institutions participating by contributing seeds, and I think that the resulting collections are developing nicely! It’s exciting to work with colleagues from all over the world. Settling on precise metadata needs and implementing them for each collection has been challenging, though. Naturally everyone is busy building their own institutions’ collections, and this collaboration is a volunteer effort. And because we are spread out across the world and many time zones we rely on email rather than conference calls, and are able to meet in person only during the annual IIPC General Assembly meetings. So, in terms of advice, I’d say be sure to define the shared collection development goals and the various roles/functions that the team will need to fill in advance, and then dive in!