I was recently asked whether script resources are changed frequently, or if they persist for a long time. I imagined that it would vary greatly by 1st and 3rd party content, so I decided to look at both.I used the difference between the
Last-Modified response headers to calculate the age. You might be wondering why I didn’t use the aptly named
Age header for this analysis, and that would be because it’s only present on 14% of HTTP responses. Comparatively,
Last-Modified is present on 72% of HTTP responses.
SELECT ROUND(SUM(IF(resp_date <> "",1,0)) / count(*),2) date_pct, ROUND(SUM(IF(resp_last_modified <> "",1,0)) / count(*),2) lastmodified_pct, ROUND(SUM(IF(resp_age <> "",1,0)) / count(*),2) age_pct, count(*) requests FROM `httparchive.summary_requests.2019_04_01_mobile` r
The query below groups resources by 1st or 3rd party, content type and the relative age of the resource in weeks. A user defined function converts the timestamp to epoch seconds and throws out any timestamps after the year 2050 in order to prevent int64 overflows from erroneous timestamps in the data. And finally, we subtract the time value of Last-Modified from Date and calculate the age in weeks.
CREATE TEMP FUNCTION dateConversion(ts STRING) RETURNS STRING LANGUAGE js AS """ epoch = Math.floor(Date.parse(ts)/1000); if (epoch <= 2558874097 && epoch > 0) return epoch """; SELECT type, IF (STRPOS(NET.HOST(r.url),REGEXP_EXTRACT(NET.REG_DOMAIN(p.url), r'([\w-]+)'))>0, 1, 3) AS party, ROUND(TIMESTAMP_DIFF( TIMESTAMP_SECONDS(SAFE_CAST(dateConversion(resp_date) as INT64)), TIMESTAMP_SECONDS(SAFE_CAST(dateConversion(resp_last_modified) as INT64)), DAY)/7) age_weeks, count(*) requests FROM `httparchive.summary_requests.2019_04_01_mobile` r INNER JOIN `httparchive.summary_pages.2019_04_01_mobile` p ON r.pageid = p.pageid WHERE resp_last_modified <> "" and resp_date <> "" GROUP BY type, party, age_weeks`
Graphically, these results look like this. Note that because there are 55 bars in this stacked chart, there is some overlap in the legend. I used a 10 color palette to display this, so that you can distinguish the week based on it’s position in the chart. For example, the yellow bars to the left are resources < 1 week old. The orange bar all the way to the right is > 2 years old.
There are some interesting observations in this data:
- With the exception of HTML, third party content has a smaller resource age compared to first party content. One can only hope that HTML is being cached somewhere…
- Audio and Video resources tend to be older and more likely to be cacheable. Even more so with 1st party videos.
- Some of the longest lived first party content on the web are the traditionally cacheable objects - images, scripts, css, web fonts, audio and video.
- There is a significant gap in the 1st vs 3rd party resource age of CSS and web fonts. 95% of first party fonts are older than 1 week compared to 50% of 3rd party fonts which are less than 1 week old! This makes a strong case for self hosting web fonts!
Since a question about scripts was what led me down this path, I thought it would be interesting to expand the bar charts for 1st vs 3rd party scripts. 35% of third party script resources and 8% of first party scripts are less than 1 week old. This explains why 3rd party script resources are less likely to be cached for long periods.
Resource age is an important heuristic in developing a good caching strategy. How frequently do certain resources change? How sensitive is your application to these changes. Would you be able to cache longer if you serve a third party resource as a first party? And how much non-cacheable third party content should your users have to load? Answering these questions about your own site’s data will help you tweak your caching policies, achieve higher offload and faster page load times.
Originally published at https://discuss.httparchive.org/t/analyzing-resource-age-by-content-type/1659