Analysing Robots.txt at scale with HTTP Archive and BigQuery

In this episode of Search Off the Record, Martin and Gary turn a simple robots.txt question into a data‑driven deep dive using HTTP Archive, WebPageTest, custom JavaScript metrics, and BigQuery. They explore how millions of real robots.txt files are actually written in 2025–2026, which directives and user‑agents are most common, and what that means for modern crawling and AI bots.

Perfect for beginner to mid‑level developers and SEOs, you’ll learn how large‑scale web measurement works (HTTP Archive, Chrome UX Report, Web Almanac), and how to turn raw crawl data into actionable SEO insights. Subscribe for more candid conversations about crawling, indexing, and the data behind how Google Search and the web really work.

Resources:

Web Almanac → https://almanac.httparchive.org/en/2025/ Robotstxt custom metric for the HTTP Archive → https://github.com/HTTPArchive/custom-metrics/pull/191 robots.txt parser change → https://github.com/google/robotstxt/commit/4af32e54b715442bb04cd0470e99192f0ffb9792#commitcomment-178586774

Episode transcript → https://goo.gle/sotr108-transcript

Listen to more Search Off the Record → https://goo.gle/sotr-yt Subscribe to Google Search Channel → https://goo.gle/SearchCentral

Search Off the Record is a podcast series that takes you behind the scenes of Google Search with the Search Relations team.

#SOTRpodcast #SEO #GoogleSearch

Speakers: Martin Splitt, Gary Illyes

source

00:00:11 Introduction to Search Off the Record
00:03:12 The Robots.txt saga: Why analyze it?
00:04:17 The goal: Identifying top unsupported directives
00:04:54 Discovery of the HTTP Archive
00:05:46 How the HTTP Archive works
00:07:46 Where the crawl data comes from (Chrome UX Report)
00:11:14 Why use a browser for crawling?
00:11:58 Querying data with BigQuery (and the cost!)
00:13:52 Using Custom Metrics for Robots.txt
00:16:54 The custom JavaScript parser
00:19:52 Using Regex to extract key-value pairs
00:21:37 Analyzing the distribution and sharp drop-off
00:22:43 Identifying broken files and common typos
00:24:30 Future impact on the Web Almanac SEO chapter
00:26:20 Closing thoughts and goodbye

Analysing Robots.txt at scale with HTTP Archive and BigQuery

2 thoughts on “Analysing Robots.txt at scale with HTTP Archive and BigQuery”

Leave a Comment Cancel Reply

Related Posts

2 thoughts on “Analysing Robots.txt at scale with HTTP Archive and BigQuery”

Leave a Comment Cancel Reply