This is an index of most of the web usage data I have collected, sorted roughly by ascending age:

  1. Data from 125K dmoz pages:
    1. Doctypes. | Same for the Alexa Top 500.
  2. Data from 15K dmoz pages:
    1. abbr, acronym titles and contents.
    2. URIs containing spaces.
    3. <meta http-equiv="..."> values.
    4. <meta http-equiv="content-type" content="..."> values.
    5. Doctypes, grouped by whether IE probably thinks they're standards/quirks. | Same but ignoring case and whitespace.
    6. Things that look like XML PIs.
    7. Errors reported by the parser.
    8. Other things vs meta generator.
    9. <u> vs meta generator.
    10. Things that look like IE conditional comments.
    11. <link method> values.
    12. URI values containing brace characters.
    13. Common attribute values, for a certain list of tag/attribute names.
  3. Data from 8K dmoz pages:
    1. Full tag/attribute data ─ counts and lists of pages for all tags and attributes. Also some bits about tokeniser parse errors, doctypes, and duplicate attributes.
    2. Similar data from the Alexa Top 500.
  4. Data from some couple of thousand pages from Yahoo search results:
    1. Tokeniser state transition frequencies.
    2. Attribute frequencies.
    3. Frequencies of number of attributes per tag.
    4. Attribute value lengths.
    5. Start tag frequencies. End tag frequencies.