Understanding web archive access and use with Google Analytics: Lessons and questions from the Federal Depository Library Program

October 10th, 2017

The following is a guest post by Andra Stump of GPO contractor Zimmerman Associates, Inc., with contributions by Dory Bower of the U.S. Government Publishing Office (GPO). Stump and Bower introduce how Archive-It partners at the Federal Depository Library Program (FDLP) Web Archive have implemented and use Google Analytics to understand the use of their existing collections and plan for the future. If you are an Archive-It partner, you can follow these brief directions to implement Google Analytics on your own collections and keep up to date on the latest advice and best practices with your peers here in our community forum.

The U.S. Government Publishing Office (GPO) has been an Archive-It partner since 2011. All the content captured must be within scope of the Federal Depository Library Program (FDLP). As agencies increasingly disseminate information directly to their websites, and not through GPO, web archiving proved a good option for continuing to provide permanent public access to Federal Agency web content. Each collection represents a single Government agency or website and includes their social media. The FDLP Web Archive may be accessed via the Catalog of U.S. Government Publications (CGP) and our Archive-It Collection Page. Understanding how our archived content is meeting the needs of users has become increasingly important to us to ensure we are meeting our goals for access. This blog describes some of the methodologies used to evaluate data using Google Analytics, as well as some challenges.

In August 2015, GPO began exploring methods to analyze our Archive-It data collected through Google Analytics. We had been requesting user metrics from Archive-It on a quarterly basis, but with the addition of the Google Analytics feature, we wanted to see what more we could learn about our users and how we could best utilize this tool. However, we first had to learn how to use Google Analytics.

The first step was to go through some of the many professional sources available, such as Google’s Analytics Academy Courses, and the articles The Suitability of Web Analytics Key by Jody Condit Fagan and Evaluating Information Seeking and Use in the Changing Virtual world: the Emerging Role of Google Analytics by D.J. Clark, David Nicholas, and Hamid Jamali. Among the information we collected, a general consensus was established to begin collecting standard data such as tracking new and return users, the top ten referrals, landing pages, and what paths visitors used to reach the archived sites: direct, referral, or organic search. To explain a little further, a referral means people click a link leading to a website; direct means people land on a website through a bookmark, typing in the address, or clicking on a link within an email; organic search means people use a search engine like Google or Bing to reach a website; and finally, a landing page refers to the first page a person views inside a website. To help us answer those aforementioned questions, each month we create a spreadsheet with those questions as headers of individual sections.

Once we started reviewing the data, visitor numbers seemed out of the ordinary. Based on Google Analytics, our average monthly visitors between June and October 2015 were approximately 2,600. This number seemed high compared to metrics that were previously provided by Archive-It. There was some digging to do!

Our referral data revealed the source of the high numbers from sites like free-social-buttons.com, floating-share-buttons.com, and get-free-social-traffic.com (Image 1). Each of these bizarre URLs are Ghost Spam, whose goal is to induce you to click on their website in analytics and lead you to spam sites. To clean our data of Ghost Spam we followed the procedures found on the blog Ohow.co: Digital Marketing & Analytics in their article “Ultimate Guide to Removing Google Analytics Spam and Other Junk Traffic.” We created a regular expression filter, wayback.archive\-it\.org|archive\-it\.org, to allow only valid hostnames to access our Google Analytics and exclude invalid hostnames/Ghost spam. The filter worked, and we continue to be Ghost Spam free.

Example of ghost spam in Google Analytics

Image 1 – Ghost Spam

Once the data was spam free, we could begin answering emerging questions such as, “how much internal traffic is reflected in the data?” No matter how small internal traffic may be, it skews the data. Google Analytics does not make IP address data available, so the ability to separate internal traffic from external traffic is challenging. Another layer of difficulty arises because at times, visitors view the same collections that we are currently viewing internally ourselves.

To begin analyzing the data, we set the calendar dates to cover the month in which we are concentrating. Next, we access data from the Landing Page report and add a secondary dimension of City (Image 2). Once the dimensions are added, the key metrics to focus on include the city of Washington and whether those from Washington have a high ‘Pages Per Session’ and ‘Average Session Duration.’ Depending on the size of the website, our team can spend multiple hours on a single collection, and this length of time causes data to be unreliable. Also available is a list of collections our team worked on for the month, allowing for quick reference and reassurance that a member of our team touched a website.

Traffic originating from Washington does not automatically mean internal. In the Image 2 below, we have three colored circles. The yellow represents URLs that our team worked on at this time and have high ‘Average Session Durations,’ while the red represents an ‘Average Session Duration’ that is high, however was something not being worked on by our team at the time. It is important to compare which collections our team has been working on to our list of Washington hits.

Tracking internal and external visits in Google Analytics

Image 2 – Internal vs. External Hits

To begin creating a filter to block internal traffic, we referred to Google Analytics Help documentation on Exclude Internal Traffic, and found our IP address range. We also used the free tool on AnalyticsMarket called IP Range Regular Expression Builder, built an IP range filter (Image 3), and confirmed the validity of the “IP Range Regular Expression Builder.” Filters require a regular expression. Once our regular expression was created, it looked something like the following example ^100\.100\.05\.([1-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$.

Internal traffic filters in Google Analytics

Image 3 – Internal traffic filter

Once the IP range filter was applied, we waited one week to allow data to build before checking the success of the filter. Unfortunately, the filter did not work. Blocking internal traffic is trickier than first expected because of the complexity of having a Dynamic Host Configuration Protocol (DHCP). Currently we are working towards configuring our dynamic IP range in Google Analytics.

As our knowledge of the data increases, so do our questions. For example, because we track referrals, we noticed a large increase of visitors from the U.S. Department of Health and Human Services (HHS) website. One month, we had 53 visitors from HHS, and the next month it increased to 1,090. We then discovered HHS created links to FDLP Web Archive content through their archive.hhs.gov site. From then on, visitor numbers from HHS grew and remains a steady source of visitors. About half of their visits are bounces (meaning they click the link to open our archived site and immediately close the window), but the visitors who do stay, browse an average of 10 minutes. These statistics indicate that users are finding information of interest, and that we are archiving information that is of value.

Since HHS is now a major source of visits per month, we updated our data spreadsheet. There is now a separate section for HHS to enable us to track the most viewed sites that come in through the HHS archive. This data will supply a picture about which major topics their users are interested in and help point us toward other valuable HHS sites to be archived.

Besides HHS, our next largest user traffic comes from universities accessing the FDLP Web Archive collections through the PURLs in our CGP records. Most of the referring universities are part of the FDLP, but also several outside of the FDLP, and a few Canadian universities. Data from these universities shows about a 50/50 split between new and return visitors, only a 45% bounce rate, and those who stay browse an average of three pages, spending up to two minutes in our archive. From this data we can surmise that visitors are willing to click on the PURL and do some searching. The type of information we are gathering from this could be very useful to us for future collection development.

Soon, we will begin answering new questions. For example, in the CGP we have been working towards enhancing accessibility of the FDLP Web Archive by supplying each collection’s catalog record with a second access point. One PURL leads to the calendar page of the homepage, and a second PURL leads to the collection page in Archive-It. How will our users utilize these options? Will one PURL prove to be more useful or equally as useful? A second change we made was adding broad subject facets and creator facets to our Archive-It metadata, allowing users to narrow down the 138 collections currently part of the FDLP Web Archive. Will there be subjects in which users are more interested? Will the broad subjects prove useful? All of these questions will be answered when we have more data to complete a whole picture.

We have been using Google Analytics for 1.5 years, and in that time we have found the data to be encouraging. People are locating our collections, and the numbers have been steadily increasing. Government organizations are also becoming more aware of our work, and we hope more will begin linking to our web archive. Google Analytics has proved a useful means for us to analyze if we are fulfilling GPO’s mission of Keeping America Informed and encouraging us to take extra steps to increase our access.