10% of the Top Million Sites are Dead

As part of an ongoing series of research I’m doing as I dabble with building a new kind of search engine (more on this later), I decided to take a dive into understanding the top websites that represent the internet as of 2022. For my purposes, the Majestic Million dataset felt like the perfect fit as it is ranked by the number of links that point to that domain (as well as taking into account diversity of the origin domains as well). Additionally, it contains subdomains as well as root domains which is a better fit for my particular research angle.

Spoilers: the Majestic Million has some data issues, always verify before using it blindly

First: let’s get the data and understand the format #

This is simple, Majestic has kindly provided a free download of the top million domains as a csv. After downloading, let’s take a quick peek at the file and verify base assumptions by loading it up in DuckDB.

I love DuckDB and will write a proper love letter to it in a future post. Easily 20x faster vs SQLite in my testing

$ duckdb
v0.4.0 da9ee490d
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

D describe select * from 'majestic_million.csv'; 
┌────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│  column_name   │ column_type │ null │ key │ default │ extra │
├────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ GlobalRank     │ INTEGER     │ YES  │     │         │       │
│ TldRank        │ INTEGER     │ YES  │     │         │       │
│ Domain         │ VARCHAR     │ YES  │     │         │       │
│ TLD            │ VARCHAR     │ YES  │     │         │       │
│ RefSubNets     │ INTEGER     │ YES  │     │         │       │
│ RefIPs         │ INTEGER     │ YES  │     │         │       │
│ IDN_Domain     │ VARCHAR     │ YES  │     │         │       │
│ IDN_TLD        │ VARCHAR     │ YES  │     │         │       │
│ PrevGlobalRank │ INTEGER     │ YES  │     │         │       │
│ PrevTldRank    │ INTEGER     │ YES  │     │         │       │
│ PrevRefSubNets │ INTEGER     │ YES  │     │         │       │
│ PrevRefIPs     │ INTEGER     │ YES  │     │         │       │
└────────────────┴─────────────┴──────┴─────┴─────────┴───────┘
D select count(distinct domain) from 'majestic_million.csv';
┌──────────────────────────┐
│ count(DISTINCT "domain") │
├──────────────────────────┤
│ 1000000                  │
└──────────────────────────┘
D select globalrank, domain from 'majestic_million.csv' limit 5;
┌────────────┬───────────────┐
│ GlobalRank │    Domain     │
├────────────┼───────────────┤
│ 1          │ google.com    │
│ 2          │ facebook.com  │
│ 3          │ youtube.com   │
│ 4          │ twitter.com   │
│ 5          │ instagram.com │
└────────────┴───────────────┘

Yes, it really is that easy thanks to DuckDB doing all the heavy lifting. And the data quality looks good with this 2 second peek: we do indeed have 1 million unique domains, and the top ones look reasonable.

Perfect. Everything looks great. End of post. Right?

Second: verifying domain normalization, or where we find our first problem #

Domain normalization is a bitch. I’ve worked on it in the past as part of my time building Facebook Search, and it’s a tricky thing to get right. One example of a super simple but commonly overlooked case is parsing out the sometimes optional www prefix.

As a concrete example, for the purpose of identifying top unique domains from a large corpus of scraped content you’d ideally want to normalize both www.google.com and google.com to the same row. Simple enough, so obviously Majestic does this right?

D select count(*) from 'majestic_million.csv' where domain glob 'www.*';
┌──────────────┐
│ count_star() │
├──────────────┤
│ 410          │
└──────────────┘
D select globalrank, domain from 'majestic_million.csv' where domain glob 'www.*' limit 5;
┌────────────┬───────────────────────┐
│ GlobalRank │        Domain         │
├────────────┼───────────────────────┤
│ 61         │ www.ncbi.nlm.nih.gov  │
│ 130        │ www.gov.uk            │
│ 275        │ www.beian.miit.gov.cn │
│ 516        │ www.nhs.uk            │
│ 642        │ www.gov.cn            │
└────────────┴───────────────────────┘

Uh oh. Errant domains with the www prefix not parsed out, and a quick peek confirms that these are not in the longtail of the list either.

Let’s check if any of these non-normalized domains have duplicates in the list.

D select count(*) from (select a.globalrank, a.domain, b.globalrank, b.domain from 'majestic_million.csv' as a inner join (select globalrank, domain, regexp_replace(domain, '^www\.', '') as norm_domain from 'majestic_million.csv' where domain glob 'www.*' order by globalrank) as b on a.domain = b.norm_domain);
┌──────────────┐
│ count_star() │
├──────────────┤
│ 221          │
└──────────────┘
D select a.globalrank, a.domain, b.globalrank, b.domain from 'majestic_million.csv' as a inner join (select globalrank, domain, regexp_replace(domain, '^www\.', '') as norm_domain from 'majestic_million.csv' where domain glob 'www.*' order by globalrank) as b on a.domain = b.norm_domain limit 5; 
┌────────────┬───────────────────┬────────────┬───────────────────────┐
│ GlobalRank │      Domain       │ GlobalRank │        Domain         │
├────────────┼───────────────────┼────────────┼───────────────────────┤
│ 72         │ beian.miit.gov.cn │ 275        │ www.beian.miit.gov.cn │
│ 1926       │ bl.uk             │ 1840       │ www.bl.uk             │
│ 2393       │ parliament.uk     │ 2376       │ www.parliament.uk     │
│ 5845       │ royal.uk          │ 5535       │ www.royal.uk          │
│ 6421       │ gov.wales         │ 553656     │ www.gov.wales         │
└────────────┴───────────────────┴────────────┴───────────────────────┘

Damn, 211 duplicate domains. The Majestic Million is officially the Majestic 999,789.

My faith was shaken. Just how much of this list could I really trust?

Third: Into the abyss, aka rapidly crawling a million domains on my macbook pro #

I decided a better quality filter step was needed to really understand this list. After some thought, I decided that a very reasonable but basic check would be to check each domain and verify that it was online and responsive to http requests. With only a million domains, this could be run from my own computer relatively simply and it would give us a very quick temperature check on whether the list truly was representative of the “top sites on the internet”.

So, let’s crack out our handy-dandy terminal and whip up a quick parallelized domain checker:

Simple line output first

$ cat majestic_million.csv
GlobalRank,TldRank,Domain,TLD,RefSubNets,RefIPs,IDN_Domain,IDN_TLD,PrevGlobalRank,PrevTldRank,PrevRefSubNets,PrevRefIPs
1,1,google.com,com,492783,2480253,google.com,com,1,1,493554,2490230
2,2,facebook.com,com,491683,2636952,facebook.com,com,2,2,492662,2648135
...

Using awk: regex to ignore header and only grab domain from each row

$ cat majestic_million.csv | \
  awk '/^[0-9]+,.+/ { split($0, a, ","); print a[3] }'
google.com
facebook.com
youtube.com
...

Set up parallelization to fully use all CPU cores (8 for me) for url processing

$ cat majestic_million.csv | \
  awk '/^[0-9]+,.+/ { split($0, a, ","); print a[3] }' | \
  xargs -n1 -P8 sh -c 'echo "$@"' _
google.com
facebook.com
youtube.com
...

And now perform the actual curl request in each sub-process. We use a simple head request since we really just care about the status code and use a 60 second timeout. We also simplify the output format to {url},{http_code} to make analysis easy

$ cat majestic_million.csv | \
  awk '/^[0-9]+,.+/ { split($0, a, ","); print a[3] }' | \
  xargs -n1 -P8 sh -c \
  'curl -LI -s -o /dev/null -w "%{url},%{http_code}\n" -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" --max-time 60 "$@"' _
google.com,200
twitter.com,200
facebook.com,200
...

Output results to a single file. I also went ahead and piped stdout to /dev/null since this actually caused my terminal to OOM thanks to my endless history setting (whoops!). Note: we use tee -a as it provides an atomic append operation that works across many threads/processes

$ cat majestic_million.csv | \
  awk '/^[0-9]+,.+/ { split($0, a, ","); print a[3] }' | \
  xargs -n1 -P8 sh -c \
  'curl -LI -s -o /dev/null -w "%{url},%{http_code}\n" -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" --max-time 60 "$@" | \
   tee -a http_codes.csv > /dev/null' _

Now we crank up the parallelization for the final run - the network was always going to be the main bottleneck for something like this instead of CPU. I found that my local system could easily handle 512 parallel processes, with my CPU @ ~35% utilization, 2GB of RAM usage, and a constant 1.5MB down on the network. Obviously YMMV depending on your system so I suggest doing your own tuning if you’re following along. We simply have to modify the -P argument on xargs to do this. Final command:

$ cat majestic_million.csv | \
  awk '/^[0-9]+,.+/ { split($0, a, ","); print a[3] }' | \
  xargs -n1 -P512 sh -c \
  'curl -LI -s -o /dev/null -w "%{url},%{http_code}\n" -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" --max-time 60 "$@" | \
   tee -a http_codes.csv > /dev/null' _

Easy peasy. Now we wait… (you can use cat http_codes.csv | wc -l to verify run status)

And in one short hour (technically ~50 minutes for me), we have a fresh shiny output file with the http response codes of the Majestic Million domains.

So now the moment of truth, let’s use our trusty friend DuckDB to check the top 5 most common response codes for the Majestic Million:

D select count(*) as num, column1 as http_code from 'http_codes_majestic_million.csv' group by column1 order by count(*) desc limit 5; 
┌────────┬───────────┐
│  num   │ http_code │
├────────┼───────────┤
│ 815669 │ 200       │
│ 107776 │ 0         │ <-- this is the one we care about
│ 18140  │ 403       │
│ 14183  │ 404       │
│ 10353  │ 301       │
└────────┴───────────┘

107,776 domains could not even be connected to. That’s 10.7% of the full list, which is pretty bad if you ask me. On top of that, there’s a longtail of sites that had a variety of non-200 reponse codes but just to be conservative we’ll assume that they are all valid anyways and maybe our curl was simply hitting some kind of Cloudflare check or a blanket block on HEAD requests. But straight connection errors to the domain homepages? Very suspect.

Conclusion: The Majestic Million 892,013 #

While I had expected some cleanliness issues, I wasn’t expecting to see this level of quality problems from a dataset that I’ve seen referenced pretty extensively across the web (and this is me being very conservative with knocking domains out).

One big caveat that I’m well aware of is that I only crawled the domain homepages, and it very well could be that many of these domains are solely configured to respond to specific endpoints. Even so, this feels iffy to me and really doesn’t indicate that this is a high-quality site worthy of placement on this kind of list. Another potential source of problems could be a misconfigured DNS or reverse-proxy where non-SSL or non-WWW requests aren’t being redirected properly. Potential follow-up for anyone interested in further digging.

As one more potential follow-up, I’m curious how alternative top domain lists (such as the Tranco and Cisco Umbrella) fair against these quality checks, but I’ll leave that for another day.

Here’s the full CSV of all http response codes pulled using the above method for those who’d like to dive in more: http_codes_majestic_million.csv.gz

Have comments? Tweet at me