William Shakespeare: Playwright, Icon, Web Archivist?

December 4th, 2013

Guest post by Jaime McCurry

Jaime McCurry is currently serving as a 2013 National Digital Stewardship Resident at the Folger Shakespeare Library in partnership with the Library of Congress and the Institute of Museum and Library Services. She is working to establish local routines and best practices for archiving and preserving born-digital content created and collected by the Folger Shakespeare Library.

Image Source

As a 2013 National Digital Stewardship Resident at the Folger Shakespeare Library, I am a newcomer to an existing web archiving initiative. The Folger has been archiving select websites related to William Shakespeare for about three years and I have been on the project for three months. I’ve had the chance, with a fresh set of eyes, to review local web archiving documentation, workflows, and collections, and advise on areas of improvement. The position is part of a wider effort to implement digital preservation practices across the organization.

It’s been such an amazing opportunity to work with web archive collections and digital preservation planning, both modern facets of a very traditional field, all the while surrounded by the history that is William Shakespeare and his works. One of the great things about being placed at the Folger and working with their web archive collections has been the opportunity to get to know Shakespeare a little better. When thinking of Shakespeare’s works, common and powerful themes come to mind. Love. Power. Mortality. It’s well-known that the Bard didn’t shy away from the hard hitting subjects of life.

What’s not well-known, however, is that most of his more popular words of wisdom can be directly applied to web archiving. In fact, they can be applied so well that if you told me William Shakespeare was involved in web archiving during his lifetime, I might be inclined to believe you!

Here are some direct quotes:

“To thine own self be true.” – Hamlet

Stay true to your collection scope. When adding seeds to an existing collection, it is important to continually keep in mind your original parameters. If you realize that the URLs in consideration are moving further away from your original intention in relevancy, it might be time to consider refining your scope or creating a new collection entirely. This advice relates to authoring parameters for new collections: the scope of your collection must also fit with your project and institutional mission on a broader level; therefore, communication with other stakeholders at your organization is imperative. For instance, at the Folger, we have a Collection Development Committee that works in partnership with the web archive project manager to consider and approve new proposals.

“Give every man thy ear but few thy voice.” – Hamlet

Listen to many potential seed URLs, speak to few. Review as many as you need for potential inclusion in your collection, choose only the ones which you are sure will enhance your collection and will be useful to your users. A well-curated, refined, meaningful collection is more valuable to your project and your users, even if this means there are only a few seeds at first.

When researching additional seeds for inclusion in our Shakespeare Festivals and Theatrical Companies collection, I compiled a list of over 200 URLs. After reviewing each one looking for certain qualities (live on the web, fit our collection scope, written in the English language, de-duped from existing seeds, etc.), 64 seeds were chosen for inclusion. Some of the criteria might seem obvious, but it’s necessary to decide what’s important to your project and compare each seed against these requirements. Have a solid understanding of what fits and what doesn’t and choose accordingly.

“God has given you one face and you make yourself another.” – Hamlet

Quality Assurance! Websites change. Often times, URLs may no longer be live on the web, or the domains have been sold and their content is no longer relevant to your collection. In such cases, these seeds need to be made inactive for future crawls. Interact with your archived material on a regular basis. Make sure that the content you are capturing is what you’ve intended to collect.

“Hell is empty and all the devils are here.” – The Tempest

Speaking of Quality Assurance: QA  Reports, I’m looking right at you on this one! This is certainly how I felt the first time I laid eyes on a comprehensive QA report. It doesn’t have to be this way though. Read through the Archive-It help documentation to better understand and get the most out of your QA reports. Attend Archive-It advanced training webinars. Ask questions of your partner specialists. When I first came into this position, there had already been local documentation created by my predecessor on QA workflows and the key points of a QA report. This was immensely helpful to me and I make a point now to keep this information updated as frequently as possible.

“To do a great right, do a little wrong.” – Merchant of Venice

Shakespeare can only be referring to one thing here: robots.txt files. The odds are favorable that you will be faced with a website that has blocked your archiving efforts with one of these. You do have the option to ignore these files on a host-by-host basis and the Archive-It application allows you to do so. Proper consideration needs to be made when taking a stance on robots.txt files. Discuss how you will handle them with your project team and with stakeholders at your institution. Document your reasoning. Fair use is an argument to ignore them, but keep in mind that sometimes robots.txt files are placed there for your benefit. The portions of sites that they are blocking might not be in useful in any way to you and the domain owner might be doing you a favor. So no matter what you choose to do, make an informed decision.

And, finally, perhaps most importantly:

“Better three hours too soon than a minute too late.” – The Merry Wives of Windsor

If there is a website that interests you for your collections, archive it as soon as possible! Web content can and frequently will disappear. What you see on the web now is not guaranteed to be there in 1, 10,  or 20 years. The website Mr. William Shakespeare and the Internet provides a perfect example of this. At one point, it was a content rich resource dedicated to Shakespeare on the web. It would have been a nice option for capture through our project. But it has since retired, and although accessible through Archive.org’s Wayback Machine, it is unfortunately no longer eligible for our efforts. If you’ve got your eye on a seed, plant it fast (see what I did there?). Shakespeare would approve.

Contact Information:
Jaime McCurry
National Digital Stewardship Resident
Folger Shakespeare Library
Twitter: @jaime_ann