The stack: An introduction to the WARC file
April 1st, 2021
by Karl-Rainer Blumenthal, Web Archivist for Archive-It
Want to know more about a tool in our web archiving toolbox? Your suggestions or questions for future posts about Archive-It technology are very welcome here.
It’s a digital preservation mantra: lots of copies keeps stuff safe (LOCKSS). And web archiving is a useful example — when sites change or disappear from the web, web archivists around the world have copies at the ready to maintain access to vital information resources. The foundation of this promise is the WARC file, a global standard for containing all of the data that we need in order to make web archives possible.
But what do we know about that WARC? It has been developed much more slowly than the technologies that collect and replay its contents. (After more than ten years, its specification is still in version 1.1). This pace befits a long-term archival container for material that is exposed to such notoriously rapid change on the live web, so we might be excused for not thinking too much about the digital box into which we shelve all these materials.
But to preserve and manage them in the longest term, it helps to know what WARCs are and are not–what they can and cannot do for future users of web archives. For an introduction to this important digital preservation standard and a peek into its contents, watch the Archive-It Advanced Training series webinar or continue reading below:
What is a WARC file?
A WARC (Web ARChive) is a container file standard for storing web content in its original context, maintained by the International Internet Preservation Consortium (IIPC).
Let’s unpack what this means. A WARC is…
- a digital file that you can store on your own local or networked storage, like a PDF document or an MP3 audio file, complete its own .warc file extension and application/warc mimetype.
- a container file that houses other files. It concatenates several files into one digital object, like you’ve seen elsewhere from container formats like ZIP, GZIP, TAR, or RAR. A WARC wraps around other files like the PDF and MP3 above, along with some additional information and formatting that we’ll cover below.
- a container for files that are native to the web. WARCs are produced by crawlers, proxies, and other utilities that retrieve files from a live web server. They can contain the PDF and MP3 files described above, for instance, but also the HTML, JS, CSS, and other structural elements that web browsers need to read in order to represent site contents to human computer users.
- a container that can also contextualize those contents. WARCs contain technical and provenance metadata about the collection and arrangement of their media so sites can be read and represented in live web browsing experiences like they were at the time of their collection.
- a standard container format. The WARC file format standard was published by the International Organization for Standardization (ISO) committee on technical interoperability as ISO 28500. You might get other outputs from web scraping tools, but WARC is the generally agreed-upon way to contain web archives such that people and their software know how to interpret and read the contents today and into the future.
- a standard maintained by web archivists. Keeping up the WARC file format standard is the responsibility of the International Internet Preservation Consortium (IIPC). This coalition of practitioners does the ‘agreeing upon’ above, that keeps the WARC relevant and vital to how we collect and preserve web archives.
A (very!) brief history
The WARC was preceded by the ARC file format, which the Internet Archive used to contain its collected web archives as far back as 1996. If your organization has ever used the Waybackfill Service or if it started crawling with Archive-It before 2009, then you still have these files in your own collections to this day as well.
A capture from one of the first ARC files created by an Archive-It partner, the South Dakota State Archives and State Library. The original page is now offline.
The ARC file was the Internet Archive’s original container file for web-native resources, so it conformed to the first three bullet points in the definition above. Reflecting the needs of web archivists around the world to preserve more context about their collected resources, the WARC standard was formalized in 2009 to include the very detailed kinds of technical metadata that we’ll explore below.
Much specificity and readability was added to the WARC standard for its 2017 upgrade to version 1.1. Thanks to the IIPC and the National Library of France (BnF), you can also access it outside of the ISO paywall now. IIPC maintains a version-controlled copy for markup here on Github and BnF’s bibum file format index houses PDF and DOC copies here.
The WARC file format has since been added to the UK National Archives’ PRONOM registry as fmt/289 and to the Library of Congress’s list of described formats for sustainability here.
A look inside
The WARC file includes metadata about its creation and contents, records of server requests and responses, and each server response’s full payload. In other words, the WARC file records everything that was done in order to record the transfer of information from a web server to its reader (like a web crawler or you at your browser). It includes the intended contents of that transfer too of course, but also some useful clues about how we can piece them back together later.
It does this in eight distinct pieces, each with its own meaning and metadata attributes. Each of these is called a WARC record. To get to know them, take any WARC from your collections or this sample file of an IIPC blog post, open it in your favorite text editor, and look for the following in the “WARC-Type” field, starting right at the top of the file:
- warcinfo: This record identifies the file as a WARC. It tells us a little bit about how and when it was created, who created it, and–in the case of Archive-It–the collection to which it belongs. It tells us precisely when this acquisition occurred, the software that was used to do so, and even the location and host machine that did the work, all of which is good provenance information for the future.
- request: Archive-It’s collecting tools must request each webpage, downloadable document, etc., from its original, live web server. This request starts with a metadata header, which includes information about the request, the requester, and how to deliver the relevant contents to them. Under the header we see the precise request as the server received it, so that it’s documented and preserved.
- response: Subsequently, the live server’s response to this request is also written into the WARC as well. Again, it begins with header metadata to contextualize it individually; the header tells us that it is a unique response to a request for a specific document at a specific time, using a specific communication method. And again, the header is followed by the original content of that delivery–the original file or code from the web that we might want to reproduce in a web browser.
With the above alone, we can use a rendering software (like Wayback) to request a document from the WARC, to get this same response that was generated at collection time, and to read the same HTML or load the same image in a web browser.
You will however also find two additional record types among most WARCs created by Archive-It, and which reflect some of the service’s helpful efficiencies:
- revisit: This record describes the response to a request for material that has already been archived, which hasn’t changed, and which Archive-It subsequently de-duplicates. By matching known checksum values in a collection, our tools can instead write a reference to an existing response record and where to find it when necessary for replay.
- resource: This record is created by the web archiving process, to capture and describe material related to an archived resource, but which might not have a discrete URL of its own. Archive-It does this most often to capture two types of resources: the screenshots and thumbnails of web pages that Brozzler creates automatically for future reference; and the videos that are retrieved by youtube-dl instead of either Brozzler or the “standard” Archive-It crawling stack.
These are the building blocks of any web archive created with Archive-It’s tools. However, the WARC specification also allows for two types of records that are not known to be implemented anywhere at the time of writing, but which speak to the management and preservation of web archives:
- conversion: This record holds space for the eventual migration of archived web materials into successor formats if and when that need arises. An HTML5 record could for instance appear here in order to augment or improve access to content that was collected in the deprecated Adobe Flash format.
- continuation: This record would enable a rendering software to read and represent an archived document across two separate records if need be. It is based on the premise that the process of writing a record’s content into a WARC file could be interrupted, and that the process could therefore be ‘continued’ in a subsequent record, just picking right up where it left off at the next line.
And finally:
- metadata: Many WARC files, including Archive-It’s, include a list of records at the bottom that can further describe the contents of the above records, so that we can better understand why they were created or what they looked like at that time. They can provide the most basic record of what we call the “scope” of an Archive-It web crawl on a record-by-record basis. For example, a metadata record for an embedded resource like an image or video might describe how the collecting tool identified it as “in-scope” and subsequently archived it.
That’s the gist! But you can always read the IIPC’s extensive documentation for many more details and case studies of all of the above.
What’s next?
For many Archive-It partners, knowing that their holdings are contained and available in a standardized format is enough to feel confident about their futures. But the LOCKSS principle doesn’t end at web capture. Here at the Internet Archive we maintain multiple copies of all partners’ W/ARCs in case of any kind of data loss. Our Storage and Preservation Policy outlines how.
Still, many Archive-It partners download W/ARC files into local or third party storage for additional preservation and care. For a detailed example, check out partner Adriane Hanson’s great blog post about the University of Georgia’s process for their own safekeeping. Now that you know what’s in the box too, I hope that this introduction can help you to gauge your need or interests in managing WARCs directly.
If you’ve read this far, then you’re already something of an advanced beginner when it comes to WARC files! Knowing what you know now, I’d be interested to hear how you would augment or improve the standard going forward. The WARC develops slowly, but it’s here to meet your web archiving needs.