The stack: A guide to A/V web archiving with youtube-dl

January 28th, 2021

by Karl-Rainer Blumenthal, Web Archivist for Archive-It

Want to know more about a tool in our web archiving toolbox? Your suggestions or questions for future posts about Archive-It technology are very welcome here.

youtube-dl logo

Archiving web-based video and audio content can challenge even the most experienced web archivist. Streaming media platforms and their technologies change often and drastically, introducing new efficiencies or features for the live web user (and headaches for the collector). To keep pace with change, preservationists need an extensible and broadly accessible tool for collecting, organizing, and representing time-based media in its original context.

All of the Internet Archive’s web archiving partners have used one such tool since the February 2020 release of Archive-It 7.0. Below, I’ll describe briefly how to recognize, use, and troubleshoot this Swiss Army knife for web-based media.

What is it and why do I care?

youtube-dl is an open-source command-line utility for retrieving the media from streaming sites and services (like, but not limited to, its namesake YouTube). It is maintained by volunteers around the world, who document and develop it here.

It’s an important, popular, and widely developed tool because 1) it provides an archival utility to sites and services that do not provide one natively, and 2) it enables preservationists specifically to preserve full, discrete files (ie. MP4, MP3, MKV, etc.) for access. The original hosts may only provide the media as visitor-unique, segmented chains (such as MP2T in the case of video) that are reconstructed live in their proprietary players and rather useless on their own to repositories and end users.

Archive-It partners use youtube-dl to identify, retrieve, store, and even to replay the video and audio contents of their web crawls. youtube-dl runs on each web page during the crawling process and deposits the media items that it can find on each of these pages into WARC files with corresponding JSON metadata. The Archive-It Wayback interface can then reference the metadata file in order to match replayed web pages with the videos that should appear on them.

When youtube-dl runs as expected, Wayback can load information from its corresponding JSON metadata files into the banner message on each archived page, specifying how many media items were archived, and even provide direct access to them via an overlaid lightbox player:

Screen recording of Archive-It Wayback video replay

The Wayback banner message enumerates the archived audio and video on the page in two ways: 

Close-up on the Archive-It Wayback banner message with media information highlighted

The first number in the series (“# out of…”) reflects how many archived items loaded successfully in the current view.

The second number (“…out of #”) represents how many media items were actually collected and stored to WARC files when this web page was archived.

A disparity between these numbers (like 1 out of 2, 6 out of 10, etc.) indicates an error somewhere between storage and replay–in either Archive-It’s Wayback software or the user’s web browser. We’ll cover how to respond to possible errors after a quick peek under the hood. 

What it looks like up-close

You can install youtube-dl on your own machine to see how the tool operates behind the scenes of Archive-It partners’ web crawls. Once installed, running youtube-dl from the command-line is as simple as: youtube-dl [URL for the webpage with desired media]. Add the optional –write-info-json syntax to also include the corresponding JSON metadata file for each request.

This example will collect the embedded video and metadata from the web page above:

$ youtube-dl –write-info-json

Any failure of youtube-dl to collect the intended media, like the audio file on this page, should be logged with an error note/description:

Screen capture of a youtube-dl command that produces an error message

These failures occur most frequently when a service or site is not compliant with youtube-dl, meaning that some element, structure, or behavior of the page prevents youtube-dl from working as intended, such as in the screenshot above. Maintainers keep an active but incomplete list of sites and services that are known to be compliant with youtube-dl here.

A QA workflow for youtube-dl

A/V will continue to test new limits of the web and present new challenges to web archivists accordingly. Knowing how to confirm that your A/V archiving utility performed properly in these cases is key.

Diagnosing the causes of any obstructions or failures to collect A/V materials is a practice of evaluating how youtube-dl ran and what it can and cannot accomplish with a given seed or web page. When you evaluate youtube-dl’s performance, ask yourself:

1. Did it run?

youtube-dl must run on a web page in order to later replay its media contents in Wayback mode.

To determine if youtube-dl ran as intended, first look to the Wayback banner message; does it mention a ≥0 total number of media items on the page? If so, then youtube-dl did indeed run and there is a corresponding JSON metadata record for the page. If not (“0 out of 0”), then youtube-dl did not run or could not find any media to archive. 

2. On which pages exactly?

This can be also be confirmed from an Archive-It partners’ post-crawl Brozzler reports.

In replay mode, Wayback is looking for that 1:1 relationship between a web page’s URL and the youtube-dl metadata record for that URL  Because these metadata records are stored to WARCs like other contents, you can filter them from the report’s Hosts or File Types lists like so:

Screen capture of an Archive-It crawl's Hosts report, filtered for youtube-dl

If the page in question appears among the New/Docs list/s here, then youtube-dl did run as intended and either did not find or could not collect any media items.

3. Can we do better?

If youtube-dl shows no sign of having run on any page with the desired media content, check to make sure that it can and should run on the page.

First and foremost, find out if the given site or page is youtube-dl compliant by running youtube-dl locally against the seed URL or page URL in question. If the source is not compliant, then only a fix upstream to youtube-dl by its external maintainers or by the site’s owners to their own code may help from here. 

For example, this page with a YouTube embed is not youtube-dl compliant:

Screenshot of an archived webpage missing an embedded video

However, Archive-It can collect the same video from a page that is compliant, like its original YouTube watch page:

Screenshot of an archived video playing on an archived YouTube page

And more help is always available to Archive-It partners! 

Let us know if audio or video fall through the cracks. We can help you to confirm what is or isn’t archive-able with current technologies. Partner reports help us to keep our own technology up-to-date with the latest developments in the streaming service arms race, and to escalate issues to youtube-dl when need be. Your collecting is always the best tool to keep web archiving technologies vital and effective.