by

On Becoming a Contributor to the HTTP Archive

The HTTP Archive is an open source project that tracks how the web is built. Twice a month it crawls 1.3 million web pages on desktop and emulated mobile devices, and collects technical information about each of the web pages. That information is then aggregated and made available in curated reports. The raw data is also made available via Google BigQuery, which makes answering interesting questions about the web accessible to anyone with some knowledge of SQL as well as the curiosity to dig in.

When Steve Souders created the project back in 2010, it included far less pages – but it was immensely valuable to the community. As sponsorship increased so did the infrastructure and the ability to do more with it. Over time more and more information was added to the archive – including HAR files, Lighthouse reports and even response bodies.

In 2017 Ilya Grigorik, Patrick Meenan and Rick Viscomi started maintaining the project. They have done some amazing work overhauling the new website, creating new and useful reports and continuing to push the envelope on what the HTTP Archive is capable of providing to the web community. As of last week I’ve joined Ilya, Pat and Rick as a co-maintainer of the HTTP Archive, and I couldn’t be more excited!

So how have I been using the HTTP Archive?

Rarely does a week go by where someone doesn’t ask a question or share a news article that doesn’t provoke a question that can be answered with the archive. I love diving deep into questions about the web, and many of my colleagues joke about “nerd sniping Paul”. Fortunately no Paul’s have been injured using BigQuery :).

Source: https://xkcd.com/356/

Over the past few months, I’ve been sharing some of my research on the HTTP Archive Discussion forums. An example of a recent post was just a few days ago when the Blink-Dev team announced that the Application Cache was being deprecated in Chrome. It only took a few minutes to write a SQL query to identify sites that are still using this feature. After doing some analysis on the data, I wound up sharing my research with the blink team so that they could track this. Since I work at Akamai, I’m also planning to give a proactive heads up to the customers whose sites will be affected. Being able to quickly notify numerous websites of an important change that might impact their business is truly a priceless use case.

At the 2018 Fluent Conference in San Jose, CA this past June, I’ve shared a few additional examples of how I’ve used the HTTP Archive at Akamai. You can see the slides here, where I talk about how I used the archive to help improve configuration defaults, assist in product research and even security notifications.

I’m truly grateful that Akamai is both sponsoring the HTTP Archive, as well as allowing me to spend some of my time supporting it. The project provides a significant benefit for the web community and it’s just so much fun to work with. I’m really looking forward to working with Ilya, Pat and Rick on this – and can’t wait to see what comes next!