The source data is 131,072 randomly-selected HTTP URIs from the Open Directory Project, downloaded on 2008-02-26. (There is obvious bias in many directions, and less obvious bias in many more, so be careful in interpreting the data.)
In the table below, HTTP is the charset from HTTP Content-Type header, using
the "algorithm for extracting an encoding from a Content-Type" from HTML5 as of 2008-03-05.
meta content is from the first <meta ... content="...">.
meta charset is from the first <meta ... charset="...">.
Sniffer is the "encoding sniffing algorithm" run on the entire document.
Decoding was done with ICU4J 3.8.1. Parsing was done with the Validator.nu HTML parser 1.0.6.
Strings were lowercased before comparison.
The lists of pages show up to 100 – the URIs with the lowest MD5s are chosen if the list is too large.
| Successfully downloaded text/htmlpages | 126989 | 
|---|
| Bytes used | Pages with charset sniffed | 
|---|---|
| (all) | 89695 | 
| 256 | 51513 (57%) | 
| 512 | 73534 (82%) | 
| 768 | 79101 (88%) | 
| 1024 | 82692 (92%) | 
| 1536 | 86192 (96%) | 
| 2048 | 87714 (98%) | 
| 4096 | 88971 (99%) | 
Also see detailed graph.
| Number of pages declaring encoding (% decoded without errors) | |||||
|---|---|---|---|---|---|
| Charset | HTTP | meta content | Sniffer | meta charset | |
| iso-8859-1 | 10320 (100%) | 53318 (100%) | 53365 (100%) | 66 (100%) | |
| utf-8 | 12253 (96%) | 11998 (96%) | 12006 (96%) | 8 (88%) | |
| windows-1252 | 119 (100%) | 11606 (100%) | 11618 (100%) | 4 (100%) | |
| shift_jis | 85 (100%) | 3587 (99%) | 3592 (99%) | 6 (100%) | |
| iso-8859-2 | 269 (100%) | 1909 (100%) | 1914 (100%) | 9 (100%) | |
| windows-1251 | 991 (100%) | 1606 (100%) | 1616 (100%) | 9 (100%) | |
| windows-1250 | 48 (100%) | 958 (100%) | 962 (100%) | 4 (100%) | |
| gb2312 | 68 (84%) | 755 (84%) | 759 (84%) | 2 (50%) | |
| iso-8859-15 | 151 (100%) | 330 (100%) | 333 (100%) | 1 (100%) | |
| us-ascii | 179 (99%) | 343 (95%) | 343 (95%) | 0 | |
| windows-1254 | 12 (100%) | 355 (100%) | 354 (100%) | 0 | |
| big5 | 47 (94%) | 343 (99%) | 344 (99%) | 1 (100%) | |
| iso-8859-9 | 65 (100%) | 306 (100%) | 307 (100%) | 0 | |
| x-sjis | 1 (100%) | 331 (97%) | 331 (97%) | 0 | |
| euc-jp | 119 (86%) | 294 (90%) | 294 (90%) | 0 | |
| iso8859-1 | 22 (100%) | 144 (100%) | 146 (100%) | 2 (100%) | |
| windows-1255 | 21 (100%) | 152 (100%) | 156 (100%) | 4 (100%) | |
| U | 23 (0%) | 129 (0%) | 128 (1%) | 0 | |
| euc-kr | 17 (94%) | 136 (98%) | 138 (98%) | 2 (100%) | |
| windows-1257 | 14 (100%) | 134 (100%) | 135 (100%) | 0 | |
| windows-1256 | 19 (100%) | 135 (100%) | 135 (100%) | 0 | |
| koi8-r | 85 (100%) | 39 (100%) | 39 (100%) | 0 | |
| none | U | 92 (0%) | 0 | 0 | 0 | 
| iso-8859-7 | 14 (100%) | 68 (100%) | 68 (100%) | 0 | |
| windows-1253 | 3 (100%) | 76 (100%) | 76 (100%) | 0 | |
| windows-874 | 4 (100%) | 58 (100%) | 58 (100%) | 0 | |
| x-windows-874 | U | 0 | 0 | 0 | 0 | 
| windows-1252; | 54 (100%) | 0 | 0 | 0 | |
| utf-8; | 33 (100%) | 12 (92%) | 12 (92%) | 0 | |
| iso-8559-1 | U | 2 (0%) | 30 (0%) | 30 (0%) | 0 | 
| tis-620 | 13 (69%) | 22 (68%) | 22 (68%) | 0 | |
| iso-2022-jp | U | 3 (0%) | 28 (0%) | 28 (0%) | 0 | 
| iso-8859-8 | 1 (100%) | 17 (94%) | 17 (94%) | 0 | |
| iso-8859-1" | 0 | 0 | 24 (100%) | 26 (100%) | |
| iso-8859-1; | 7 (100%) | 17 (100%) | 17 (100%) | 0 | |
| utf8 | 10 (100%) | 12 (83%) | 12 (83%) | 0 | |
| gbk | 10 (100%) | 15 (100%) | 15 (100%) | 0 | |
| unicode | 0 | 18 (56%) | 18 (56%) | 0 | |
| cp1251 | 15 (100%) | 1 (100%) | 1 (100%) | 0 | |
| latin1 | 12 (100%) | 4 (100%) | 4 (100%) | 0 | |
| utf-16 | 0 | 16 (50%) | 16 (50%) | 0 | |
| x-mac-roman | U | 0 | 15 (0%) | 15 (0%) | 0 | 
| shift-jis | 2 (100%) | 11 (100%) | 12 (100%) | 1 (100%) | |
| cp1252 | 12 (100%) | 0 | 0 | 0 | |
| en | U | 11 (0%) | 1 (0%) | 1 (0%) | 0 | 
| macintosh | 0 | 12 (100%) | 12 (100%) | 0 | |
| iso-8859-8-i | 3 (100%) | 10 (100%) | 10 (100%) | 0 | |
| x-euc-jp | 0 | 10 (100%) | 10 (100%) | 0 | |
| ks_c_5601-1987 | 0 | 9 (100%) | 9 (100%) | 0 | |
| .utf8 | 8 (88%) | 0 | 0 | 0 | |
| iso | U | 0 | 8 (0%) | 8 (0%) | 0 | 
| cp-1251 | 6 (100%) | 1 (100%) | 1 (100%) | 0 | |
| iso-8859-5 | 0 | 7 (100%) | 7 (100%) | 0 | |
| 0 | U | 6 (0%) | 0 | 0 | 0 | 
| bs_4730 | U | 6 (0%) | 0 | 0 | 0 | 
| iso- | U | 0 | 6 (0%) | 6 (0%) | 0 | 
| iso-8859-1, | 1 (100%) | 6 (100%) | 6 (100%) | 0 | |
| _charset | U | 1 (0%) | 4 (0%) | 4 (0%) | 0 | 
| ascii | 0 | 5 (80%) | 5 (80%) | 0 | |
| euc_kr | 5 (100%) | 0 | 0 | 0 | |
| iso-8859-4 | 1 (100%) | 5 (100%) | 5 (100%) | 0 | |
| utf-8" | 1 (100%) | 1 (100%) | 3 (100%) | 3 (67%) | |
| visual | U | 0 | 3 (0%) | 3 (0%) | 2 (0%) | 
| windows-31j | 4 (100%) | 1 (100%) | 1 (100%) | 0 | |
| charset=iso-8859-1 | U | 0 | 4 (0%) | 4 (0%) | 0 | 
| iso-8859 | U | 1 (0%) | 3 (0%) | 3 (0%) | 0 | 
| iso-8859-13 | 0 | 3 (100%) | 3 (100%) | 0 | |
| iso8859-2 | 2 (100%) | 2 (100%) | 2 (100%) | 0 | |
| iso8859_1 | 3 (100%) | 2 (100%) | 2 (100%) | 0 | |
| iso_8859-1 | 4 (100%) | 0 | 0 | 0 | |
| ucs-2 | 0 | 4 (50%) | 4 (50%) | 0 | |
| x-user-defined | U | 0 | 4 (0%) | 4 (0%) | 1 (0%) | 
| charset=utf-8 | U | 0 | 3 (0%) | 3 (0%) | 0 | 
| euc_jp | 2 (100%) | 1 (100%) | 1 (100%) | 0 | |
| iso-8859-3 | 0 | 3 (67%) | 3 (67%) | 0 | |
| iso.8859-1 | 3 (100%) | 0 | 0 | 0 | |
| iso8859-9 | 1 (100%) | 2 (100%) | 2 (100%) | 0 | |
| latin-1 | 3 (100%) | 1 (100%) | 1 (100%) | 0 | |
| win-1251 | U | 3 (0%) | 0 | 0 | 0 | 
| windows-1250" | 1 (100%) | 0 | 2 (100%) | 2 (100%) | |
| windows-1252" | 0 | 0 | 2 (100%) | 3 (100%) | |
| windows1254 | 0 | 3 (100%) | 3 (100%) | 0 | |
| 0ff | U | 2 (0%) | 0 | 0 | 0 | 
| 8859-1 | 0 | 2 (100%) | 2 (100%) | 0 | |
| 8859_1 | 2 (100%) | 0 | 0 | 0 | |
| <$mtpublishcharset$> | U | 0 | 2 (0%) | 2 (0%) | 0 | 
| ansi | U | 2 (0%) | 0 | 0 | 0 | 
| big5,euc-jp | U | 2 (0%) | 0 | 0 | 0 | 
| charset | U | 2 (0%) | 0 | 0 | 0 | 
| charset=windows-1251 | U | 2 (0%) | 0 | 0 | 0 | 
| en_us | U | 2 (0%) | 0 | 0 | 0 | 
| es | U | 0 | 2 (0%) | 2 (0%) | 0 | 
| euc | U | 1 (0%) | 1 (0%) | 1 (0%) | 0 | 
| is0-8859-1 | U | 1 (0%) | 1 (0%) | 1 (0%) | 0 | 
| iso-10646 | U | 0 | 2 (0%) | 2 (0%) | 0 | 
| iso-8 | U | 0 | 2 (0%) | 2 (0%) | 0 | 
| iso-8559-2 | U | 0 | 2 (0%) | 2 (0%) | 0 | 
| iso-8829-2 | U | 0 | 2 (0%) | 2 (0%) | 0 | 
| iso-8859-1> | 0 | 2 (100%) | 2 (100%) | 0 | |
| iso-88591 | 0 | 2 (100%) | 2 (100%) | 0 | |
| no | U | 2 (0%) | 0 | 0 | 0 | 
| null | U | 0 | 2 (0%) | 2 (0%) | 0 | 
| uft-8 | U | 0 | 2 (0%) | 2 (0%) | 0 | 
| utf-32 | 1 (0%) | 1 (0%) | 1 (0%) | 0 | |
| utf-9 | U | 1 (0%) | 1 (0%) | 1 (0%) | 0 | 
| win | U | 2 (0%) | 0 | 0 | 0 | 
| win-1250 | U | 1 (0%) | 1 (0%) | 1 (0%) | 0 | 
| win-1252 | U | 1 (0%) | 0 | 1 (0%) | 1 (0%) | 
| windows | U | 2 (0%) | 0 | 0 | 0 | 
| %charset% | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| , | U | 1 (0%) | 0 | 0 | 0 | 
| .iso-8859-1 | 1 (100%) | 0 | 0 | 0 | |
| 0,-1 | U | 1 (0%) | 0 | 0 | 0 | 
| 10646 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| 1250 | U | 1 (0%) | 0 | 0 | 0 | 
| 1254 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| <% | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| U | 0 | 1 (0%) | 1 (0%) | 0 | |
| _iso | U | 1 (0%) | 0 | 0 | 0 | 
| armscii-8 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| auto | U | 1 (0%) | 0 | 0 | 0 | 
| beagle kennel van der liniehoeve | U | 0 | 0 | 0 | 1 (0%) | 
| big-5 | 1 (100%) | 0 | 0 | 0 | |
| big5; | 0 | 1 (100%) | 1 (100%) | 0 | |
| cp1250 | 0 | 1 (100%) | 1 (100%) | 0 | |
| cp1256 | 0 | 1 (100%) | 1 (100%) | 0 | |
| cp852 | 0 | 1 (100%) | 1 (100%) | 0 | |
| de_de@euro | U | 1 (0%) | 0 | 0 | 0 | 
| en-iso-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| en-us | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| enu-kr | U | 1 (0%) | 0 | 0 | 0 | 
| es_es.utf8 | U | 1 (0%) | 0 | 0 | 0 | 
| euc-2 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| euckr | 1 (100%) | 0 | 0 | 0 | |
| gb-2312 | 1 (100%) | 0 | 0 | 0 | |
| gb2312" | 0 | 0 | 1 (0%) | 1 (0%) | |
| gb2312-80 | 0 | 1 (0%) | 1 (0%) | 0 | |
| gb2312;charset=iso-8859-1 | U | 1 (0%) | 0 | 0 | 0 | 
| gb2312\" | 0 | 0 | 1 (100%) | 1 (100%) | |
| greek | 0 | 1 (100%) | 1 (100%) | 0 | |
| iso-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| ibm852 | 0 | 0 | 0 | 0 | |
| ico-8859-1 | U | 1 (0%) | 0 | 0 | 0 | 
| is0-8859-2 | U | 1 (0%) | 0 | 0 | 0 | 
| iso-10646-ucs-2 | 0 | 1 (0%) | 1 (0%) | 0 | |
| iso-10646-utf-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-1250 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-202059-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-2022 | U | 1 (0%) | 0 | 0 | 0 | 
| iso-2022-kr | U | 1 (0%) | 0 | 0 | 0 | 
| iso-5589-2 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-8759-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-88 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-8840-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-8850-15/ | U | 0 | 0 | 1 (0%) | 1 (0%) | 
| iso-8851-1 | U | 1 (0%) | 0 | 0 | 0 | 
| iso-8859- | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-8859-1"/ | 0 | 0 | 0 | 1 (100%) | |
| iso-8859-13" | 0 | 0 | 1 (100%) | 1 (100%) | |
| iso-8859-15; | 1 (100%) | 0 | 0 | 0 | |
| iso-8859-1;pageencoding=iso-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-8859-1s | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-8859-2" | 0 | 0 | 1 (100%) | 1 (100%) | |
| iso-8859-2; | 1 (100%) | 0 | 0 | 0 | |
| iso-8859-2> | 0 | 1 (100%) | 1 (100%) | 0 | |
| iso-8859-6 | 1 (100%) | 1 (100%) | 1 (100%) | 0 | |
| iso-8859-9" | 0 | 0 | 1 (100%) | 1 (100%) | |
| iso-8859-l | U | 0 | 0 | 1 (0%) | 1 (0%) | 
| iso-8895-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-99-59-1 | U | 0 | 0 | 1 (0%) | 1 (0%) | 
| iso-9959-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-iso-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso-utf-8 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| iso.8859-15 | 1 (100%) | 0 | 0 | 0 | |
| iso8859-15 | 0 | 1 (100%) | 1 (100%) | 0 | |
| iso8859-7 | 0 | 1 (100%) | 1 (100%) | 0 | |
| iso88591 | 0 | 1 (100%) | 1 (100%) | 0 | |
| iso_8859-15 | 1 (100%) | 0 | 0 | 0 | |
| iso_8859-9 | 0 | 1 (100%) | 1 (100%) | 0 | |
| iso_8859_1 | 0 | 1 (100%) | 1 (100%) | 0 | |
| it_it.iso8859-15 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| koi8-u | 1 (100%) | 1 (100%) | 1 (100%) | 0 | |
| ksc5601 | 1 (100%) | 0 | 0 | 0 | |
| langtagcharsetiso | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| latin | U | 1 (0%) | 0 | 0 | 0 | 
| latin2 | 0 | 1 (100%) | 1 (100%) | 0 | |
| ms932 | 1 (100%) | 0 | 0 | 0 | |
| nl_nl.iso8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| pl | U | 1 (0%) | 0 | 0 | 0 | 
| pt-iso-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| shft_jis | U | 1 (0%) | 0 | 0 | 0 | 
| shift_jis; | 0 | 0 | 1 (100%) | 1 (100%) | |
| sjis | 1 (100%) | 0 | 0 | 0 | |
| unicode-1-1-utf-8 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| user-defined | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| utf-8"" | 0 | 0 | 1 (100%) | 1 (100%) | |
| utf8_czech_cs | U | 1 (0%) | 0 | 0 | 0 | 
| western | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| white | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| widows-1250 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| win1251 | U | 1 (0%) | 0 | 0 | 0 | 
| window-874 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| windows-1250; | 0 | 1 (100%) | 1 (100%) | 0 | |
| windows-1251" | 0 | 0 | 1 (100%) | 1 (100%) | |
| windows-1252' | 0 | 1 (100%) | 1 (100%) | 0 | |
| windows-1252romance | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| windows-1255; | 0 | 1 (100%) | 1 (100%) | 0 | |
| windows-2252 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| windows-8859 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| windows-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| windows-8859-2 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| windows-8859-2" | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| windows-932 | 1 (100%) | 0 | 0 | 0 | |
| x-mac-thai | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| x-mac-turkish | 0 | 0 | 0 | 1 (100%) | |
| x-x-big5 | U | 0 | 1 (0%) | 1 (0%) | 0 | 
| {charset} | U | 0 | 1 (0%) | 1 (0%) | 0 |