Not All Websites Are Made Equal (Or Friendly): Archiving ephemeral art content on the web

April 26th, 2013

The New York Art Resources Consortium (NYARC), in conjunction with Truman Technologies, recently published a series of web archiving reports including an in depth case study using Archive-It to develop a web archiving program to capture organizational and artist websites. These findings, including consultant recommendations for Archive-It are available to read online here.

In addition to these reports, a recent article published in the Spring 2013 issue of Art Documentation by Heather Slania, “Online Art Ephemera Web Archiving at the National Museum of Women in the Arts” (JSTOR)  takes a closer look at the challenges of current web archiving technology from the perspective of a small arts organization, the National Museum of Women in the Arts (NWMA) based in Washington D.C., in their attempt to build a thriving web archiving program focused on capturing ephemeral art content on the web.


National Museum of Women in the Arts, http://www.nmwa.org/about

Slania’s article relates specifically to the quality assurance work in web archiving – the process of evaluating archived materials post-crawl to determine the completeness of the captures and troubleshoot portions of sites that do not archive well on the first attempt.  One strength of the article is the details shared regarding the web archiving process including the challenges of archiving websites that are not archive-friendly, a term we might use to describe a site a site that archives well with minimal effort using current web archiving technology like Archive-It.

Consider, for example, the website of visual artist Lynn Hershman which was selected to be archived as part of NMWA’s Contemporary Women Artists on the Web collection. Slania spotlights Hershman’s site because it was was not completely archived in the initial efforts to capture the content, likely due to dynamic content (in this case Flash, which a major component of the site). After these initial captures of the site, and after the article was written, Archive-It updated to Heritrix 3.1, a newer and more robust version of the web crawler that is better able to handle complex dynamic content and might have been able to better capture Hershman’s homepage before its redesign.

But then, something changed. Starting in March 2013, all information on the site, including images and the overall style was captured completely. This archived version replicated the experience a user would have had viewing the live version of the site. What changed? In this case, the site itself. This new redesign left behind proprietary media players like Flash. Instead, it was built and styled primarily with CSS/Javascript and the WordPress CMS. This more archive-friendly format is now available to be archived as part of NMWA’s collecting mission.

1. Earliest version of http://www.lynnhershman.com/ archived by the Internet Archive’s Archive.org Global Wayback Machine, May 20th, 2000.

2. Incomplete capture a previous redesign of Lynn Hershman’s site, cited in “Online Art Ephemera Web Archiving at the National Museum of Women in the Arts” from September 5th, 2012.

3. Most recent redesign of http://www.lynnhershman.com/, captured on March 5th, 2013.

Not all websites are made equal. After all, the web is a hodgepodge of technologies, some old and outdated, others at the cutting edge. By outward appearance a website may seem to be normal and fully functioning, but its status as a media object exists only temporarily. It may work because of a specific browser technology, proprietary software, or other variables. Removed from its current location on any particular web server its ability to stand to the test of time is far less certain. This ephemerality is the challenge and motivation for web archivists who work towards preserving the web for future generations.

As Slania notes in her article, “the use of dynamic content and cutting-edge web technologies will be a problem that will plague art web archivists as long as there are web developers and artists attempting to create new and groundbreaking platforms.” Looking towards the future, there is unlikely to be one catch-all technology that will archive the web in its many forms as it inevitably changes. A variety of capture mechanisms, including but not exclusively web crawlers, may work together to give us a complete and authentic capture of a website.

There is still plenty of work to do to get us to that point. In the meantime archivists and even webmasters are considering how web design can facilitate the archival process, making websites more friendly to archiving technology. Slania is correct to suggest that greater awareness among web developers regarding lost history online might advance the cause of webmasters building “the concept of archiving into the new technologies.”

In general, we might consider the below guidelines in determining whether or not a website will archive well with minimal effort using current web archiving technology.

An archive-friendly website:

  • Shares many of the same qualities of search engine optimized (SEO) websites – those that are ready to be indexed by web crawlers.
  • All unique pages should have a unique URL, directly linked to by another page within the same host.
  • Logical site structure, reflected in semantic URLs. This would allow web archivists to target specific portions of websites, as well as allow for easier analysis of archived content through URLs.
  • Robots.txt files are written to include archive crawlers and do not block files containing important styling information.
  • Sitemaps are available as an alternative for crawlers to discover content that may be behind forms, search boxes, or other database driven tools available to website visitors on the live web.
  • Limited content, if any, is served via proprietary services that requires users to download additional software on their browser. This is often, although not exclusively so, difficult to playback in archive form.

When Lynn Hershman’s website moved from the territory of archive-unfriendly to archive-friendly, NMWA was more easily able to use current web archiving technology to preserve and provide access to the content in perpetuity.

Ultimately, these guidelines will change as web technology changes, and web archiving technology responds. Everyday, Archive-It along with the Internet Archive and its network of technologists are working to improve the ability of Heritrix to crawl the web and capture authentic and fully functioning copies of dynamic and modern websites.  The input of our partners is invaluable to this process and we encourage organizations like NMWA to develop case studies, experiment, analyze results, and share feedback with the Archive-It team.

What do you think of these guidelines? What’s missing?  What types of content are you trying to archive?  What challenges are you seeing? We would love to hear your feedback in the comments.