The data comes from the index released by http://www.dotnetdotcom.org/ (thanks!) It contains data for 425543 URLs. The index is dated "200904", and downloaded on 2009-04-24. They say: "A Note On Crawl Quality: The downloadable index above contains a uniform sample of pages. We're trying to keep this updated as we crawl new pages, but as you can imagine it's a little bit tricky :) The data is roughly as fresh as two to three months. The oldest pages might be four months old; the newest might be weeks old. Also due to our crawling method, Our crawl is probably biased toward English speaking sites within the US. Lo siento, mi amigos ;-)" All URLs had HTTP 200 responses. Content-Type headers (ignoring case and parameters) were: 424422 text/html 1227 text/xml 911 text/plain 81 text/vnd.wap.wml 56 binary/octet-stream 250 (everything else) (410385 URLs had at least one Content-Type. Some had up to 26.)