The source data is 131,072 randomly-selected HTTP URIs from the Open Directory Project, downloaded on 2008-02-26. (There is obvious bias in many directions, and less obvious bias in many more, so be careful in interpreting the data.)
In the table below, HTTP is the charset from HTTP Content-Type header, using
the "algorithm for extracting an encoding from a Content-Type" from HTML5 as of 2008-03-05.
meta content is from the first <meta ... content="...">
.
meta charset is from the first <meta ... charset="...">
.
Sniffer is the "encoding sniffing algorithm" run on the first 1024 bytes of data.
Decoding was done with ICU4J 3.8.1. Parsing and encoding sniffing was done with the Validator.nu HTML parser 1.0.6, which matches the spec as of June 2007; the name recorded is the canonical name, not necessarily the detected string.
Strings were lowercased before comparison.
The lists of pages show up to 100 – the URIs with the lowest MD5s are chosen if the list is too large.
Successfully downloaded text/html pages | 126989 |
---|
Bytes used | Pages with charset sniffed |
---|---|
(all) | 89013 |
256 | 51043 |
512 | 73037 |
1024 | 82082 |
Number of pages declaring encoding (% decoded without errors) | |||||
---|---|---|---|---|---|
Charset | HTTP | meta content | Sniffer | meta charset | |
iso-8859-1 | 10320 (100%) | 53318 (100%) | 53614 (100%) | 66 (100%) | |
utf-8 | 12253 (96%) | 11998 (96%) | 12053 (96%) | 8 (88%) | |
windows-1252 | 119 (100%) | 11606 (100%) | 11628 (100%) | 4 (100%) | |
shift_jis | 85 (100%) | 3587 (99%) | 3937 (99%) | 6 (100%) | |
iso-8859-2 | 269 (100%) | 1909 (100%) | 1922 (100%) | 9 (100%) | |
windows-1251 | 991 (100%) | 1606 (100%) | 1177 (100%) | 9 (100%) | |
windows-1250 | 48 (100%) | 958 (100%) | 968 (100%) | 4 (100%) | |
gb2312 | 68 (84%) | 755 (84%) | 762 (84%) | 2 (50%) | |
iso-8859-15 | 151 (100%) | 330 (100%) | 335 (100%) | 1 (100%) | |
us-ascii | 179 (99%) | 343 (95%) | 349 (95%) | 0 | |
windows-1254 | 12 (100%) | 355 (100%) | 359 (100%) | 0 | |
big5 | 47 (94%) | 343 (99%) | 346 (99%) | 1 (100%) | |
iso-8859-9 | 65 (100%) | 306 (100%) | 312 (100%) | 0 | |
x-sjis | 1 (100%) | 331 (97%) | 0 | 0 | |
euc-jp | 119 (86%) | 294 (90%) | 305 (90%) | 0 | |
iso8859-1 | 22 (100%) | 144 (100%) | 0 | 2 (100%) | |
windows-1255 | 21 (100%) | 152 (100%) | 157 (100%) | 4 (100%) | |
U | 23 (0%) | 129 (0%) | 0 | 0 | |
euc-kr | 17 (94%) | 136 (98%) | 147 (98%) | 2 (100%) | |
windows-1257 | 14 (100%) | 134 (100%) | 136 (100%) | 0 | |
windows-1256 | 19 (100%) | 135 (100%) | 136 (100%) | 0 | |
koi8-r | 85 (100%) | 39 (100%) | 39 (100%) | 0 | |
none | U | 92 (0%) | 0 | 0 | 0 |
iso-8859-7 | 14 (100%) | 68 (100%) | 70 (100%) | 0 | |
windows-1253 | 3 (100%) | 76 (100%) | 76 (100%) | 0 | |
windows-874 | 4 (100%) | 58 (100%) | 0 | 0 | |
x-windows-874 | U | 0 | 0 | 58 (0%) | 0 |
windows-1252; | 54 (100%) | 0 | 0 | 0 | |
utf-8; | 33 (100%) | 12 (92%) | 0 | 0 | |
iso-8559-1 | U | 2 (0%) | 30 (0%) | 0 | 0 |
tis-620 | 13 (69%) | 22 (68%) | 22 (68%) | 0 | |
iso-2022-jp | U | 3 (0%) | 28 (0%) | 28 (0%) | 0 |
iso-8859-8 | 1 (100%) | 17 (94%) | 27 (96%) | 0 | |
iso-8859-1" | 0 | 0 | 0 | 26 (100%) | |
iso-8859-1; | 7 (100%) | 17 (100%) | 0 | 0 | |
utf8 | 10 (100%) | 12 (83%) | 0 | 0 | |
gbk | 10 (100%) | 15 (100%) | 15 (100%) | 0 | |
unicode | 0 | 18 (56%) | 0 | 0 | |
cp1251 | 15 (100%) | 1 (100%) | 0 | 0 | |
latin1 | 12 (100%) | 4 (100%) | 0 | 0 | |
utf-16 | 0 | 16 (50%) | 0 | 0 | |
x-mac-roman | U | 0 | 15 (0%) | 0 | 0 |
shift-jis | 2 (100%) | 11 (100%) | 0 | 1 (100%) | |
cp1252 | 12 (100%) | 0 | 0 | 0 | |
en | U | 11 (0%) | 1 (0%) | 0 | 0 |
macintosh | 0 | 12 (100%) | 12 (100%) | 0 | |
iso-8859-8-i | 3 (100%) | 10 (100%) | 0 | 0 | |
x-euc-jp | 0 | 10 (100%) | 0 | 0 | |
ks_c_5601-1987 | 0 | 9 (100%) | 0 | 0 | |
.utf8 | 8 (88%) | 0 | 0 | 0 | |
iso | U | 0 | 8 (0%) | 0 | 0 |
cp-1251 | 6 (100%) | 1 (100%) | 0 | 0 | |
iso-8859-5 | 0 | 7 (100%) | 7 (100%) | 0 | |
0 | U | 6 (0%) | 0 | 0 | 0 |
bs_4730 | U | 6 (0%) | 0 | 0 | 0 |
iso- | U | 0 | 6 (0%) | 0 | 0 |
iso-8859-1, | 1 (100%) | 6 (100%) | 0 | 0 | |
_charset | U | 1 (0%) | 4 (0%) | 0 | 0 |
ascii | 0 | 5 (80%) | 0 | 0 | |
euc_kr | 5 (100%) | 0 | 0 | 0 | |
iso-8859-4 | 1 (100%) | 5 (100%) | 5 (100%) | 0 | |
utf-8" | 1 (100%) | 1 (100%) | 0 | 3 (67%) | |
visual | U | 0 | 3 (0%) | 0 | 2 (0%) |
windows-31j | 4 (100%) | 1 (100%) | 1 (100%) | 0 | |
charset=iso-8859-1 | U | 0 | 4 (0%) | 0 | 0 |
iso-8859 | U | 1 (0%) | 3 (0%) | 0 | 0 |
iso-8859-13 | 0 | 3 (100%) | 4 (100%) | 0 | |
iso8859-2 | 2 (100%) | 2 (100%) | 0 | 0 | |
iso8859_1 | 3 (100%) | 2 (100%) | 0 | 0 | |
iso_8859-1 | 4 (100%) | 0 | 0 | 0 | |
ucs-2 | 0 | 4 (50%) | 0 | 0 | |
x-user-defined | U | 0 | 4 (0%) | 0 | 1 (0%) |
charset=utf-8 | U | 0 | 3 (0%) | 0 | 0 |
euc_jp | 2 (100%) | 1 (100%) | 0 | 0 | |
iso-8859-3 | 0 | 3 (67%) | 3 (67%) | 0 | |
iso.8859-1 | 3 (100%) | 0 | 0 | 0 | |
iso8859-9 | 1 (100%) | 2 (100%) | 0 | 0 | |
latin-1 | 3 (100%) | 1 (100%) | 0 | 0 | |
win-1251 | U | 3 (0%) | 0 | 0 | 0 |
windows-1250" | 1 (100%) | 0 | 0 | 2 (100%) | |
windows-1252" | 0 | 0 | 0 | 3 (100%) | |
windows1254 | 0 | 3 (100%) | 0 | 0 | |
0ff | U | 2 (0%) | 0 | 0 | 0 |
8859-1 | 0 | 2 (100%) | 0 | 0 | |
8859_1 | 2 (100%) | 0 | 0 | 0 | |
<$mtpublishcharset$> | U | 0 | 2 (0%) | 0 | 0 |
ansi | U | 2 (0%) | 0 | 0 | 0 |
big5,euc-jp | U | 2 (0%) | 0 | 0 | 0 |
charset | U | 2 (0%) | 0 | 0 | 0 |
charset=windows-1251 | U | 2 (0%) | 0 | 0 | 0 |
en_us | U | 2 (0%) | 0 | 0 | 0 |
es | U | 0 | 2 (0%) | 0 | 0 |
euc | U | 1 (0%) | 1 (0%) | 0 | 0 |
is0-8859-1 | U | 1 (0%) | 1 (0%) | 0 | 0 |
iso-10646 | U | 0 | 2 (0%) | 0 | 0 |
iso-8 | U | 0 | 2 (0%) | 0 | 0 |
iso-8559-2 | U | 0 | 2 (0%) | 0 | 0 |
iso-8829-2 | U | 0 | 2 (0%) | 0 | 0 |
iso-8859-1> | 0 | 2 (100%) | 0 | 0 | |
iso-88591 | 0 | 2 (100%) | 0 | 0 | |
no | U | 2 (0%) | 0 | 0 | 0 |
null | U | 0 | 2 (0%) | 0 | 0 |
uft-8 | U | 0 | 2 (0%) | 0 | 0 |
utf-32 | 1 (0%) | 1 (0%) | 0 | 0 | |
utf-9 | U | 1 (0%) | 1 (0%) | 0 | 0 |
win | U | 2 (0%) | 0 | 0 | 0 |
win-1250 | U | 1 (0%) | 1 (0%) | 0 | 0 |
win-1252 | U | 1 (0%) | 0 | 0 | 1 (0%) |
windows | U | 2 (0%) | 0 | 0 | 0 |
%charset% | U | 0 | 1 (0%) | 0 | 0 |
, | U | 1 (0%) | 0 | 0 | 0 |
.iso-8859-1 | 1 (100%) | 0 | 0 | 0 | |
0,-1 | U | 1 (0%) | 0 | 0 | 0 |
10646 | U | 0 | 1 (0%) | 0 | 0 |
1250 | U | 1 (0%) | 0 | 0 | 0 |
1254 | U | 0 | 1 (0%) | 0 | 0 |
<% | U | 0 | 1 (0%) | 0 | 0 |
U | 0 | 1 (0%) | 0 | 0 | |
_iso | U | 1 (0%) | 0 | 0 | 0 |
armscii-8 | U | 0 | 1 (0%) | 0 | 0 |
auto | U | 1 (0%) | 0 | 0 | 0 |
beagle kennel van der liniehoeve | U | 0 | 0 | 0 | 1 (0%) |
big-5 | 1 (100%) | 0 | 0 | 0 | |
big5; | 0 | 1 (100%) | 0 | 0 | |
cp1250 | 0 | 1 (100%) | 0 | 0 | |
cp1256 | 0 | 1 (100%) | 0 | 0 | |
cp852 | 0 | 1 (100%) | 0 | 0 | |
de_de@euro | U | 1 (0%) | 0 | 0 | 0 |
en-iso-8859-1 | U | 0 | 1 (0%) | 0 | 0 |
en-us | U | 0 | 1 (0%) | 0 | 0 |
enu-kr | U | 1 (0%) | 0 | 0 | 0 |
es_es.utf8 | U | 1 (0%) | 0 | 0 | 0 |
euc-2 | U | 0 | 1 (0%) | 0 | 0 |
euckr | 1 (100%) | 0 | 0 | 0 | |
gb-2312 | 1 (100%) | 0 | 0 | 0 | |
gb2312" | 0 | 0 | 0 | 1 (0%) | |
gb2312-80 | 0 | 1 (0%) | 0 | 0 | |
gb2312;charset=iso-8859-1 | U | 1 (0%) | 0 | 0 | 0 |
gb2312\" | 0 | 0 | 0 | 1 (100%) | |
greek | 0 | 1 (100%) | 0 | 0 | |
iso-8859-1 | U | 0 | 1 (0%) | 0 | 0 |
ibm852 | 0 | 0 | 1 (100%) | 0 | |
ico-8859-1 | U | 1 (0%) | 0 | 0 | 0 |
is0-8859-2 | U | 1 (0%) | 0 | 0 | 0 |
iso-10646-ucs-2 | 0 | 1 (0%) | 0 | 0 | |
iso-10646-utf-1 | U | 0 | 1 (0%) | 0 | 0 |
iso-1250 | U | 0 | 1 (0%) | 0 | 0 |
iso-202059-1 | U | 0 | 1 (0%) | 0 | 0 |
iso-2022 | U | 1 (0%) | 0 | 0 | 0 |
iso-2022-kr | U | 1 (0%) | 0 | 0 | 0 |
iso-5589-2 | U | 0 | 1 (0%) | 0 | 0 |
iso-8759-1 | U | 0 | 1 (0%) | 0 | 0 |
iso-88 | U | 0 | 1 (0%) | 0 | 0 |
iso-8840-1 | U | 0 | 1 (0%) | 0 | 0 |
iso-8850-15/ | U | 0 | 0 | 0 | 1 (0%) |
iso-8851-1 | U | 1 (0%) | 0 | 0 | 0 |
iso-8859- | U | 0 | 1 (0%) | 0 | 0 |
iso-8859-1"/ | 0 | 0 | 0 | 1 (100%) | |
iso-8859-13" | 0 | 0 | 0 | 1 (100%) | |
iso-8859-15; | 1 (100%) | 0 | 0 | 0 | |
iso-8859-1;pageencoding=iso-8859-1 | U | 0 | 1 (0%) | 0 | 0 |
iso-8859-1s | U | 0 | 1 (0%) | 0 | 0 |
iso-8859-2" | 0 | 0 | 0 | 1 (100%) | |
iso-8859-2; | 1 (100%) | 0 | 0 | 0 | |
iso-8859-2> | 0 | 1 (100%) | 0 | 0 | |
iso-8859-6 | 1 (100%) | 1 (100%) | 1 (100%) | 0 | |
iso-8859-9" | 0 | 0 | 0 | 1 (100%) | |
iso-8859-l | U | 0 | 0 | 0 | 1 (0%) |
iso-8895-1 | U | 0 | 1 (0%) | 0 | 0 |
iso-99-59-1 | U | 0 | 0 | 0 | 1 (0%) |
iso-9959-1 | U | 0 | 1 (0%) | 0 | 0 |
iso-iso-8859-1 | U | 0 | 1 (0%) | 0 | 0 |
iso-utf-8 | U | 0 | 1 (0%) | 0 | 0 |
iso.8859-15 | 1 (100%) | 0 | 0 | 0 | |
iso8859-15 | 0 | 1 (100%) | 0 | 0 | |
iso8859-7 | 0 | 1 (100%) | 0 | 0 | |
iso88591 | 0 | 1 (100%) | 0 | 0 | |
iso_8859-15 | 1 (100%) | 0 | 0 | 0 | |
iso_8859-9 | 0 | 1 (100%) | 0 | 0 | |
iso_8859_1 | 0 | 1 (100%) | 0 | 0 | |
it_it.iso8859-15 | U | 0 | 1 (0%) | 0 | 0 |
koi8-u | 1 (100%) | 1 (100%) | 1 (100%) | 0 | |
ksc5601 | 1 (100%) | 0 | 0 | 0 | |
langtagcharsetiso | U | 0 | 1 (0%) | 0 | 0 |
latin | U | 1 (0%) | 0 | 0 | 0 |
latin2 | 0 | 1 (100%) | 0 | 0 | |
ms932 | 1 (100%) | 0 | 0 | 0 | |
nl_nl.iso8859-1 | U | 0 | 1 (0%) | 0 | 0 |
pl | U | 1 (0%) | 0 | 0 | 0 |
pt-iso-8859-1 | U | 0 | 1 (0%) | 0 | 0 |
shft_jis | U | 1 (0%) | 0 | 0 | 0 |
shift_jis; | 0 | 0 | 0 | 1 (100%) | |
sjis | 1 (100%) | 0 | 0 | 0 | |
unicode-1-1-utf-8 | U | 0 | 1 (0%) | 0 | 0 |
user-defined | U | 0 | 1 (0%) | 0 | 0 |
utf-8"" | 0 | 0 | 0 | 1 (100%) | |
utf8_czech_cs | U | 1 (0%) | 0 | 0 | 0 |
western | U | 0 | 1 (0%) | 0 | 0 |
white | U | 0 | 1 (0%) | 0 | 0 |
widows-1250 | U | 0 | 1 (0%) | 0 | 0 |
win1251 | U | 1 (0%) | 0 | 0 | 0 |
window-874 | U | 0 | 1 (0%) | 0 | 0 |
windows-1250; | 0 | 1 (100%) | 0 | 0 | |
windows-1251" | 0 | 0 | 0 | 1 (100%) | |
windows-1252' | 0 | 1 (100%) | 0 | 0 | |
windows-1252romance | U | 0 | 1 (0%) | 0 | 0 |
windows-1255; | 0 | 1 (100%) | 0 | 0 | |
windows-2252 | U | 0 | 1 (0%) | 0 | 0 |
windows-8859 | U | 0 | 1 (0%) | 0 | 0 |
windows-8859-1 | U | 0 | 1 (0%) | 0 | 0 |
windows-8859-2 | U | 0 | 1 (0%) | 0 | 0 |
windows-8859-2" | U | 0 | 1 (0%) | 0 | 0 |
windows-932 | 1 (100%) | 0 | 0 | 0 | |
x-mac-thai | U | 0 | 1 (0%) | 0 | 0 |
x-mac-turkish | 0 | 0 | 0 | 1 (100%) | |
x-x-big5 | U | 0 | 1 (0%) | 0 | 0 |
{charset} | U | 0 | 1 (0%) | 0 | 0 |