Archive-It upgrades to version 7.0

February 10th, 2020

by the Archive-It team

We are pleased to announce the release of Archive-It 7.0 to our community of web archiving partners. This release includes several critical upgrades to web crawling, archival replay, and data reporting tools, and lays foundation for still more developments on the roadmap. Read the full Archive-It 7.0 Release Notes for information about and documentation of each new feature.

For release with version 7.0, Archive-It engineers developed Crawlboss–a back-end manager for all of the information about web crawls and their configurations. All Archive-It partners will see the benefits of this in their crawl reports, which now include more information about seeds and their crawl settings. The same technical and provenance information can also be retrieved and repurposed through the Archive-It Partner API. Paired with Archive-It 7.0’s introduction of seed-specific data de-duplication for all web crawls, these enhancements make the sites that constitute a partners’ collection more discrete and portable–easier to manage, move, and preserve individually.

Screenshot of new seed information available in Archive-It 7.0 crawl reports

Crawl reports now include seed types and settings

Crawlboss also introduces the audio/video download utility youtube-dl to all web crawls. Previously exclusive to Archive-It’s browser-based web capture tool Brozzler, youtube-dl now also enhances the traditional Heritrix web crawler’s ability to archive challenging audio and video elements. Leveraging the additional A/V data and metadata enabled by this upgrade, Archive-It Wayback can retrieve more time-based media from the archives for front-end access.

Screenshot of new A/V replay tool in Archive-It Wayback

Archive-It Wayback now includes a lightbox viewer for time-based media

To enhance access to all of the myriad formats of web data that partners collect in the meantime, Archive-It 7.0 uses OutbackCDX–a widely supported engine for indexing the contents of web archive collections. OutbackCDX generates and updates these indexes faster and more automatically than Archive-It’s legacy server could, meaning less wait time between collecting and sharing.

Together, Crawlboss and OutbackCDX lay the foundation for even more upgrades to Archive-It’s web archiving stack, including high partner priorities like more direct access controls over web captures, options for moving them among different collections, a Python-based Wayback replay tool, and always more opportunity to integrate with other software systems. Watch this space and the Archive-It development roadmap for these updates and more like them.