HTML usage data

This is an index of most of the web usage data I have collected, sorted roughly by ascending age:

Data from 125K dmoz pages:
1. Doctypes. | Same for the Alexa Top 500.
Data from 15K dmoz pages:
Data from 8K dmoz pages:
1. Full tag/attribute data ─ counts and lists of pages for all tags and attributes. Also some bits about tokeniser parse errors, doctypes, and duplicate attributes.
2. Similar data from the Alexa Top 500.
Data from some couple of thousand pages from Yahoo search results: