A New Wayback: Improving Web Archive Replay

September 8th, 2021

by the Archive-It team

The Internet Archive is excited to announce the preliminary release of a significant upgrade to the Wayback web archive replay software that our partners use to access and browse their web archive collections. The new version of Wayback is a complete rebuild of the prior version of the software used by both Archive-It and the many customized access portals that we build and host on behalf of our worldwide users.

The new Wayback represents a step forward in the quality and completeness of web archive replay. It will more easily facilitate future feature developments and continues the legacy of Wayback software as the original, most widely used, and most actively maintained web archive replay tool since its original release in 2001.

The new Wayback being released across our web archiving services also better integrates with the version powering the Wayback Machine, which will enable easier sharing of replay fixes and collaborative development. We will be rolling out the new Wayback to Archive-It, our custom contract web archiving services, and our hosted access portals over the next few weeks. Look out for more news as we bring improved web archive replay to all our partners.

A screenshot of the Internet Archive's Wayback Machine access point in 2001

Wayback Machine access on archive.org in 2001

The “Wayback Machine” launched in 2001 as the first public view into the Internet Archive’s web archives. Archive-It’s own customized version of the original Wayback replay software has seen significant development over the years to meet user needs and to add new features and enhancements. In that time, the web has become a more dynamic and complex medium to preserve. In response, we built Brozzler, our browser-based web harvester, for improved archiving of dynamic, responsive, and media-rich websites; the new version of Wayback will facilitate a similar improvement in the rendering of archived web content.

The new Wayback was rewritten completely in the more contemporary Python programming language and includes the new Replay Rules Engine, which is a more extensible system supporting the innumerable replay enhancements originating from years of Archive-It partners’ quality assurance efforts and the Archive-It team’s technical replay fixes. The Replay Rules Engine will also allow for broader community sharing of, and contribution to, a library of replay fixes that can be shared across systems. After production release across our various web archiving services and platforms, and any bugs are squashed, the new Wayback code will be made available under open-source license on our Github page.

Animation of a COVID dashboard preserved by the IIPC as it replays in the legacy and new Archive-It Wayback replay environments

A COVID dashboard preserved by the IIPC as it replays in the legacy (left) and new (right) Archive-It Wayback replay environments

Internet Archive staff and Archive-It partners collaborated on testing and refining the new Wayback, which passed a battery of quality control measures to achieve parity with the soon-to-be-mothballed Java-based version. Testers and partners documented the many areas of replay improvement from the new Wayback, especially its facilities with modern content encoding, client-side URLs, and the replay of archived data dashboards, media players, and popular press sites.

For their generous commitments to beta testing, we are indebted and grateful to the web archiving partners at Columbia University Libraries, International Internet Preservation Consortium (IIPC), East Baton Rouge Parish Library, Gates Archive, Harvard University Archives, Library and Archives Canada, National Library of Medicine, New York Art Resources Consortium (NYARC), New York University, and the University of North Carolina at Chapel Hill Libraries. From the Internet Archive team, special shout-outs go to Software Engineer Barbara Miller and Web Archivist Karl-Rainer Blumenthal for their contributions and leadership in the project, and to Lead Software Engineer Kenji Nagahashi for original development for the Wayback Machine.

Screenshots of a CNN.com news feed collected by the National Library of Medicine as it replays in the legacy and new Wayback replay environments

A live news feed collected by the National Library of Medicine as it replays in the legacy (left) and new (right) Wayback replay environments

Ideally, the upgrade to the new Wayback will be anticlimactic to our many Archive-It and managed crawling partners — your archives will replay as expected and without disruption and, overall, users should see both improved replay fidelity and better performance. Moving forward, our goal is to reduce the volume and complexity of custom code interventions required by partners who collect highly interactive websites and to build on this new version of Wayback by adding new tools and features to make web archiving, and viewing web archives, easier for all.

Learn more

You can watch the recorded webinar below to learn much more about the release, including more example improvements from Archive-It partners and live discussion.