This is how we do it: University of Innsbruck

October 30th, 2013

The following is a guest post from Archive-It Partner, Renate Giacomuzzi of the University of Innsbruck. The University of Innsbruck’s Newspaper Archive co-hosted our first International Web Archiving Meeting on September 20th, 2013. As part of the program Renate Giacomuzzi, Armin Schleicher, and Elisabeth Sporer presented on their work with Archive-It. In addition to harvesting and managing their collections within the web application, their group created unique web portals as well as technology solutions to extract and index archived content to meet the unique needs of their collections and patrons.


The Innsbruck Newspaper Archive at the University of Innsbruck is an archive and research center for literary criticism and dissemination of literature. The term literary criticism in German does not refer to academic criticism, but to criticism in newspapers – this means book reviews and similar. Therefor the nucleus of our collection consists of book and theater reviews, collected since the late sixties. Today the archive provides access to over one million digitized articles and a publicly searchable database.


As an archive for literary criticism we could not ignore the growing number of literary magazines, book reviews and author homepages on the Internet. Therefore, we had to find a way to collect and republish web resources. In 2007, we started the pilot-project DILIMAG – a collection of over 200 ‘born online’ literary magazines. With help of our partner, the Department for Digitization of the University Library, we were able to build a database and archive a total of 147 magazines. For over 70 of these magazines, we got the permission to provide either public or restricted access to the archived versions within our collection.


To archive websites to include in DILIMAG, we started out using the Open Source WebCurator Tool to perform our harvests. After this first web archiving-project ended, we fortunately were able to acquire new funding for a project aiming to create an archive of author homepages. We decided to strengthen the cooperation with other web archiving institutions and contacted the Archive-It team, as the Internet Archive is widely known as the most experienced institution in the field of web archiving in the world. The cooperation with Archive-It is very helpful because we rely on personal technical support quite frequently, especially regarding problems with certain crawls. In addition, the good reputation of the Internet Archive leads to cooperation with other archives and of course the right-holders of the web-based content we were aiming to harvest.

Inspired by the very intuitive collection page of Archive-It, we created a custom website for each of our collections, DILIMAG (going online soon, and Author Homepages, which contains around seventy digital literature magazines and roughly a hundred author webpages. For both projects, we also collect relevant metadata like the date of first publication, name of the website owner, types of content, and more. We also include a short description so that patrons can get an impression of the page at a glance, even without looking into the actual archive.


It is always important to make sure that people are aware they are viewing archived content, so we display a notification banner on every page within our collection.  However, we also encourage users to navigate to the live version of the websites as well to bring attention to the original content creators.  In addition, we implemented features such as full-text and metadata search and options for sorting and filtering through materials in the collection. To import the harvests into our local repository, we created an application that automatically copies, indexes and filters the archive files and updates our database accordingly.

The complexity of the harvesting process is directly related to the structure of the content that needs to be archived.  For the DILIMAG project, we had to put much more effort into the crawler configuration and testing than was necessary for the author homepages. This is because the magazine websites contained feature-rich, highly dynamic websites holding large amount of content, while authors homepages tend to create very clean, well structured and easily accessible static pages. Some authors use blog software to publish and administrate their personal homepage, which makes it even easier to have a general, reusable approach for harvesting.  The most interesting parts of author homepages are materials like video and audio documents, which can be challenging to integrate into our archive.

We are currently very interested in further investigating are Facebook pages and Twitter feeds of authors. We feel that adding such content to the archive can greatly improve our collection, as authors use these channels to communicate with their fan base and often provide a very intimate perspective on their day-to-day life and work. Due to legal issues and also the way such social networks are built, we have chosen to extract very select (curated) content using a parser instead of the crawler and not to recreate the full user experience including the pages’ look and feel.

The web is more and more important for the dissemination of literature. Publishing houses, magazines, authors and readers are taking advantage of the medium as a communication and advertising tool, but, as we’ve noticed, need to take better care to preserve these materials for future generations. This task must be been taken on by non-profit institutions, archives, and libraries, if they take seriously their responsibility to save the present for the future.