On Becoming a Contributor to the HTTP Archive

The HTTP Archive is an open source project that tracks how the web is built. Twice a month it crawls 1.3 million web pages on desktop and emulated mobile devices, and collects technical information about each of the web pages. That information is then aggregated and made available in curated reports. The raw data is also made available via Google BigQuery, which makes answering interesting questions about the web accessible to anyone with some knowledge of SQL as well as the curiosity to dig in.

When Steve Souders created the project back in 2010, it included far less pages – but it was immensely valuable to the community. As sponsorship increased so did the infrastructure and the ability to do more with it. Over time more and more information was added to the archive – including HAR files, Lighthouse reports and even response bodies.

In 2017 Ilya Grigorik, Patrick Meenan and Rick Viscomi started maintaining the project. They have done some amazing work overhauling the new website, creating new and useful reports and continuing to push the envelope on what the HTTP Archive is capable of providing to the web community. As of last week I’ve joined Ilya, Pat and Rick as a co-maintainer of the HTTP Archive, and I couldn’t be more excited!

So how have I been using the HTTP Archive?

Rarely does a week go by where someone doesn’t ask a question or share a news article that doesn’t provoke a question that can be answered with the archive. I love diving deep into questions about the web, and many of my colleagues joke about “nerd sniping Paul”. Fortunately no Paul’s have been injured using BigQuery :).

Source: https://xkcd.com/356/

Over the past few months, I’ve been sharing some of my research on the HTTP Archive Discussion forums. An example of a recent post was just a few days ago when the Blink-Dev team announced that the Application Cache was being deprecated in Chrome. It only took a few minutes to write a SQL query to identify sites that are still using this feature. After doing some analysis on the data, I wound up sharing my research with the blink team so that they could track this. Since I work at Akamai, I’m also planning to give a proactive heads up to the customers whose sites will be affected. Being able to quickly notify numerous websites of an important change that might impact their business is truly a priceless use case.

At the 2018 Fluent Conference in San Jose, CA this past June, I’ve shared a few additional examples of how I’ve used the HTTP Archive at Akamai. You can see the slides here, where I talk about how I used the archive to help improve configuration defaults, assist in product research and even security notifications.

I’m truly grateful that Akamai is both sponsoring the HTTP Archive, as well as allowing me to spend some of my time supporting it. The project provides a significant benefit for the web community and it’s just so much fun to work with. I’m really looking forward to working with Ilya, Pat and Rick on this – and can’t wait to see what comes next!

Brotli Compression – How Much Will It Reduce Your Content?

A few years ago Brotli compression entered into the webperf spotlight with impressive gains of up to 25% over gzip compression. The algorithm was created by Google, who initially introduced it as a way to compress web fonts via the woff2 format. Later in 2015 it was released as a compression library to optimize the delivery of web content. Despite Brotli being a completely different format from Gzip, it was quickly supported by most modern web browsers.

Continue reading

HTTP Heuristic Caching (Missing Cache-Control and Expires Headers) Explained

Have you ever wondered why WebPageTest can sometimes show that a repeat view loaded with less bytes downloaded, while also triggering warnings related to browser caching? It can seem like the test is reporting an issue that does not exist, but in fact it’s often a sign of a more serious issue that should be investigated. Often the issue is not the lack of caching, but rather lack of control over how your content is cached.

If you have not run into this issue before, then examine the screenshot below to see an example:

Continue reading

Adoption of HTTP Security Headers on the Web

Over the past few weeks the topic of security related HTTP headers has come up in numerous discussions – both with customers I work with as well as other colleagues that are trying to help improve the security posture of their customers. I’ve often felt that these headers were underutilized, and a quick test on Scott Helme’s excellent securityheaders.io site usually proves this to be true. I decided to take a deeper look at how these headers are being used on a large scale.

Looking at this data through the lens of the HTTP Archive, I thought it would be interesting to see if we could give the web a scorecard for security headers. I’ll dive deeper into how each of these headers are implemented below, but let’s start off by looking at the percentage of sites that are using these security headers. As I suspected, adoption is quite low. Furthermore, it seems that adoption is marginally higher for some of the most popular sites – but not by much.

Continue reading

Cache Control Immutable – A Year Later

In January 2017, Facebook wrote about a new Cache-Control directive – immutable – which was designed to tell supported browsers not to attempt to revalidate an object on a normal reload during it’s freshness lifetime. Firefox 49 implemented it, while Chrome went ahead with a different approach by changing the behavior of the reload button. Additionally it seems that WebKit has also implemented the immutable directive since then.

So it’s been a year – let’s see where Cache-Control immutable is being used in the wild!

Continue reading

Measuring the Performance of Firefox Quantum with RUM

On Nov 14th, Mozilla released Firefox Quantum. On launch day, I personally felt that the new version was rendering pages faster and I heard anecdotal reports indicating the same. There have also been a few benchmarks which seem to show that this latest Firefox version is getting content to screens faster than its predecessor. But I wanted to try a different approach to measurement.

Given the vast amount of performance information that we collect at Akamai, I thought it would be interesting to benchmark the performance of Firefox Quantum with a large set of real end-user performance data. The results were dramatic: the new browser improved DOM Content Loaded time by an extremely impressive 24%. Let’s take a look at how those results were achieved.



Continue reading

Which 3rd Party Content Loads Before Render Start?

Since the HTTP Archive is capturing the timing information on each request, I thought it would be interesting to correlate request timings (ie, when an object was loaded) with page timings. The idea is that we can categorize resources that were loaded before or after and event.

Content Type Loaded Before/After Render Start It’s generally well known that third party content impacts performance. We see this with both resource loading, and JavaScript execution blocking the browser from loading other content. While we don’t have the data to evaluate script execution timings per resource captured here, we can definitely look at when resources were loaded with respect to certain timings and get an idea of what is being loaded before a page starts rendering. Continue reading