Understanding the “web” in web archives: What we preserve and why

August 31st, 2020

by Jillian Lohndorf, Web Archivist for Archive-It

When you think about the web, what do you imagine? Maybe it’s a flashback to your first interactions with the web? Or your first social media accounts? Or maybe the events of 2020 make you think of virtual school, searching for bread recipes, or too many Zoom meetings? 

But how often do you think of physical things?

While we often talk about the output (cat gifs!), or use metaphors (“the cloud”), we don’t often talk about the infrastructure of the web, or its big sister, the internet. But understanding the web (and the internet) is a crucial component of web archiving, and the Archive-It approach to web archiving.  

The internet is a network of computers, all working to move data from one place to another, often very far away, place. The internet itself is very physical: cables connecting servers; cable highways; shark-repellent cables under the ocean; data centers filled with servers; servers storing information; servers using routers to move packets of information; wifi using radio waves to move that information; satellites beaming information. All of these physical pieces work together to get data from one place to another. And people’s help, of course, maintaining each part and creating new technologies. 

The world wide web sits on top of the Internet. When information is sent over the web, it’s broken up into smaller chunks (called packets), to make it easier to move.  Have you ever heard the phrase “ones and zeros?” Well, everything that is transported across the internet is in ones and zeros, also known as binary. The web makes this binary data readable for humans, and browsers make it accessible for end users. When you plug a URL (a webpage’s digital address) into your browser, software takes that information, translates it into binary, and sends it over the physical hardware to arrive at your digital door. When it arrives, it’s reassembled. Web archiving works in a similar way, but with different technology, by capturing all of these pieces of information, storing it on physical machines, and then reassembling it on demand. These three steps–capture, storage, and replay–are the foundation of every Archive-It collection.

Most people access the live web or a web archive through a browser, such as Firefox, Chrome, Internet Explorer, Safari or Brave. Browsers are programs that move and store information, and play an important role in how information is accessed. While browsers have the same goal generally, each can go about it just differently enough that the same webpage may display differently. Browsers will also sometimes save little bits of information from websites to speed up performance–how quickly a page loads; collectively these saved bits are called a cache and can influence how a webpage looks.

All of these browser behaviors impact web archives as well. Has a Web Archivist ever recommended that you clear your cache? This is because there’s a new piece of information to replay, but the browser is remembering the old piece. The precise URL that you plug into a browser also plays a part in web archiving, as its construction helps direct the Archive-It crawling technologies to what should or should not be captured. Things like http versus https, www or no www, even the trailing slash (/) at the end of an address–all of these can affect what a web crawler or a patron of your archives sees because it changes the very real scope and route of a data transfer.  

While parts of the internet and the web infrastructure have stayed much the same, other parts have changed significantly. Archive-It has as well. One particularly relevant example of this is the development of the Brozzler web capture technology. Brozzler is a web browser-based collecting tool, meaning that it monitors and archives the network connections made by reading a webpage. Unlike the standard web crawling technology that seeks out to preserve only source code, Brozzler listens to the conversation that a webpage has with the browser as information is requested and sent back and forth. This helps Brozzler better detect dynamic elements that may or may not require a specific request or click to happen. However, even this kind of emulated online interaction is bounded by physical constraints, such as early in early 2020, when Internet Archive engineers had to add more servers and storage capacity to meet the demand of Brozzler’s use by Archive-It partners at its peak, lest requests and responses break like an interrupted dial-up connection. 

The web is a growing, changing landscape. New technologies are still developed to preserve what we deem important to sustain, be it informational, cultural, or even infrastructural. The foundational architecture choices made 30 years ago affect how web archiving works today, and the choices we make today will continue to affect the web long into the future.