Amazon has taken down its famous Top 1 Million Websites CSV file, apparently without prior notice. Thousands of analytics systems and WWW tracking services are now dependent on the Amazon Top Sites service. At a quarter of a dollar per 100 URLs, the million websites data is now worth U$ 2500 in its entirety if fetched from the paid AWS service.
The takedown marks an online tendency where companies that detain access to large scale databases about the WWW are gradually phasing out information systems that used to give us a glimpse into the structure of the web itself.
Google, for instance, took down its link: operator years ago and started displaying a mock PageRank indicator that was either very outdated or outright misleading, in order to fight the link buying industry (that was sparked by PageRank itself). Blekko used to allow searching for certain specific metadata, but the system is now offline (or has been devoured by IBM Watson). Bing still allows you to search by IP address, through the ip: operator, but it’s a cult service, mostly used by web professionals, that might also end up taken down without much fanfare.
Still, some techs insist on unraveling the structure of the WWW. Common Crawl, for instance, provides a massive public database that anyone with the right tools can explore. And WebDataCommons.com has created a top sites service, based on the original Google PageRank formula(PageRank, which by the way has been hidden by Google years ago) that allows you to investigate the top WWW sites.
So, for the Alexa orphans out there, check out WebDataCommons.com and the Common Crawl websites. Given that they’re becoming a rare species, do consider donating to Common Crawl for the sake of the open WWW.