This is an index of most of the web usage data I have collected, sorted roughly by ascending age:
- Data from 125K dmoz pages:
- Doctypes.
| Same for the Alexa Top 500.
- Data from 15K dmoz pages:
abbr
, acronym
titles and contents.
- URIs containing spaces.
<meta http-equiv="...">
values.
<meta http-equiv="content-type" content="...">
values.
- Doctypes, grouped by whether IE probably thinks they're standards/quirks.
| Same but ignoring case and whitespace.
- Things that look like XML PIs.
- Errors reported by the Validator.nu parser.
- Other things vs
meta generator
.
<u>
vs meta generator
.
- Things that look like IE conditional comments.
<link method>
values.
- URI values containing brace characters.
- Common attribute values, for a certain list of tag/attribute names.
- Data from 8K dmoz pages:
- Full tag/attribute data ─ counts and lists of pages for all tags and attributes. Also some bits about tokeniser parse errors, doctypes, and duplicate attributes.
- Similar data from the Alexa Top 500.
- Data from some couple of thousand pages from Yahoo search results:
- Tokeniser state transition frequencies.
- Attribute frequencies.
- Frequencies of number of attributes per tag.
- Attribute value lengths.
- Start tag frequencies.
End tag frequencies.