The source data is 131,072 randomly-selected HTTP URIs from the Open Directory Project, downloaded on 2008-02-26. (There is obvious bias in many directions, and less obvious bias in many more, so be careful in interpreting the data.)
In the table below, HTTP is the charset from HTTP Content-Type header, using
the "algorithm for extracting an encoding from a Content-Type" from HTML5 as of 2008-03-05.
meta content is from the first <meta ... content="...">
.
meta charset is from the first <meta ... charset="...">
.
Sniffer is the "encoding sniffing algorithm" run on the entire document.
Decoding was done with ICU4J 3.8.1. Parsing was done with the Validator.nu HTML parser 1.0.6.
Strings were lowercased before comparison.
The lists of pages show up to 100 – the URIs with the lowest MD5s are chosen if the list is too large.
Successfully downloaded text/html pages | 126989 |
---|
Bytes used | Pages with charset sniffed |
---|---|
(all) | 89695 |
256 | 51513 (57%) |
512 | 73534 (82%) |
768 | 79101 (88%) |
1024 | 82692 (92%) |
1536 | 86192 (96%) |
2048 | 87714 (98%) |
4096 | 88971 (99%) |
Also see detailed graph.
Number of pages declaring encoding (% decoded without errors) | |||||
---|---|---|---|---|---|
Charset | HTTP | meta content | Sniffer | meta charset | |
iso-8859-1 | 10320 (100%) | 53318 (100%) | 53365 (100%) | 66 (100%) | |
utf-8 | 12253 (96%) | 11998 (96%) | 12006 (96%) | 8 (88%) | |
windows-1252 | 119 (100%) | 11606 (100%) | 11618 (100%) | 4 (100%) | |
shift_jis | 85 (100%) | 3587 (99%) | 3592 (99%) | 6 (100%) | |
iso-8859-2 | 269 (100%) | 1909 (100%) | 1914 (100%) | 9 (100%) | |
windows-1251 | 991 (100%) | 1606 (100%) | 1616 (100%) | 9 (100%) | |
windows-1250 | 48 (100%) | 958 (100%) | 962 (100%) | 4 (100%) | |
gb2312 | 68 (84%) | 755 (84%) | 759 (84%) | 2 (50%) | |
iso-8859-15 | 151 (100%) | 330 (100%) | 333 (100%) | 1 (100%) | |
us-ascii | 179 (99%) | 343 (95%) | 343 (95%) | 0 | |
windows-1254 | 12 (100%) | 355 (100%) | 354 (100%) | 0 | |
big5 | 47 (94%) | 343 (99%) | 344 (99%) | 1 (100%) | |
iso-8859-9 | 65 (100%) | 306 (100%) | 307 (100%) | 0 | |
x-sjis | 1 (100%) | 331 (97%) | 331 (97%) | 0 | |
euc-jp | 119 (86%) | 294 (90%) | 294 (90%) | 0 | |
iso8859-1 | 22 (100%) | 144 (100%) | 146 (100%) | 2 (100%) | |
windows-1255 | 21 (100%) | 152 (100%) | 156 (100%) | 4 (100%) | |
U | 23 (0%) | 129 (0%) | 128 (1%) | 0 | |
euc-kr | 17 (94%) | 136 (98%) | 138 (98%) | 2 (100%) | |
windows-1257 | 14 (100%) | 134 (100%) | 135 (100%) | 0 | |
windows-1256 | 19 (100%) | 135 (100%) | 135 (100%) | 0 | |
koi8-r | 85 (100%) | 39 (100%) | 39 (100%) | 0 | |
none | U | 92 (0%) | 0 | 0 | 0 |
iso-8859-7 | 14 (100%) | 68 (100%) | 68 (100%) | 0 | |
windows-1253 | 3 (100%) | 76 (100%) | 76 (100%) | 0 | |
windows-874 | 4 (100%) | 58 (100%) | 58 (100%) | 0 | |
x-windows-874 | U | 0 | 0 | 0 | 0 |
windows-1252; | 54 (100%) | 0 | 0 | 0 | |
utf-8; | 33 (100%) | 12 (92%) | 12 (92%) | 0 | |
iso-8559-1 | U | 2 (0%) | 30 (0%) | 30 (0%) | 0 |
tis-620 | 13 (69%) | 22 (68%) | 22 (68%) | 0 | |
iso-2022-jp | U | 3 (0%) | 28 (0%) | 28 (0%) | 0 |
iso-8859-8 | 1 (100%) | 17 (94%) | 17 (94%) | 0 | |
iso-8859-1" | 0 | 0 | 24 (100%) | 26 (100%) | |
iso-8859-1; | 7 (100%) | 17 (100%) | 17 (100%) | 0 | |
utf8 | 10 (100%) | 12 (83%) | 12 (83%) | 0 | |
gbk | 10 (100%) | 15 (100%) | 15 (100%) | 0 | |
unicode | 0 | 18 (56%) | 18 (56%) | 0 | |
cp1251 | 15 (100%) | 1 (100%) | 1 (100%) | 0 | |
latin1 | 12 (100%) | 4 (100%) | 4 (100%) | 0 | |
utf-16 | 0 | 16 (50%) | 16 (50%) | 0 | |
x-mac-roman | U | 0 | 15 (0%) | 15 (0%) | 0 |
shift-jis | 2 (100%) | 11 (100%) | 12 (100%) | 1 (100%) | |
cp1252 | 12 (100%) | 0 | 0 | 0 | |
en | U | 11 (0%) | 1 (0%) | 1 (0%) | 0 |
macintosh | 0 | 12 (100%) | 12 (100%) | 0 | |
iso-8859-8-i | 3 (100%) | 10 (100%) | 10 (100%) | 0 | |
x-euc-jp | 0 | 10 (100%) | 10 (100%) | 0 | |
ks_c_5601-1987 | 0 | 9 (100%) | 9 (100%) | 0 | |
.utf8 | 8 (88%) | 0 | 0 | 0 | |
iso | U | 0 | 8 (0%) | 8 (0%) | 0 |
cp-1251 | 6 (100%) | 1 (100%) | 1 (100%) | 0 | |
iso-8859-5 | 0 | 7 (100%) | 7 (100%) | 0 | |
0 | U | 6 (0%) | 0 | 0 | 0 |
bs_4730 | U | 6 (0%) | 0 | 0 | 0 |
iso- | U | 0 | 6 (0%) | 6 (0%) | 0 |
iso-8859-1, | 1 (100%) | 6 (100%) | 6 (100%) | 0 | |
_charset | U | 1 (0%) | 4 (0%) | 4 (0%) | 0 |
ascii | 0 | 5 (80%) | 5 (80%) | 0 | |
euc_kr | 5 (100%) | 0 | 0 | 0 | |
iso-8859-4 | 1 (100%) | 5 (100%) | 5 (100%) | 0 | |
utf-8" | 1 (100%) | 1 (100%) | 3 (100%) | 3 (67%) | |
visual | U | 0 | 3 (0%) | 3 (0%) | 2 (0%) |
windows-31j | 4 (100%) | 1 (100%) | 1 (100%) | 0 | |
charset=iso-8859-1 | U | 0 | 4 (0%) | 4 (0%) | 0 |
iso-8859 | U | 1 (0%) | 3 (0%) | 3 (0%) | 0 |
iso-8859-13 | 0 | 3 (100%) | 3 (100%) | 0 | |
iso8859-2 | 2 (100%) | 2 (100%) | 2 (100%) | 0 | |
iso8859_1 | 3 (100%) | 2 (100%) | 2 (100%) | 0 | |
iso_8859-1 | 4 (100%) | 0 | 0 | 0 | |
ucs-2 | 0 | 4 (50%) | 4 (50%) | 0 | |
x-user-defined | U | 0 | 4 (0%) | 4 (0%) | 1 (0%) |
charset=utf-8 | U | 0 | 3 (0%) | 3 (0%) | 0 |
euc_jp | 2 (100%) | 1 (100%) | 1 (100%) | 0 | |
iso-8859-3 | 0 | 3 (67%) | 3 (67%) | 0 | |
iso.8859-1 | 3 (100%) | 0 | 0 | 0 | |
iso8859-9 | 1 (100%) | 2 (100%) | 2 (100%) | 0 | |
latin-1 | 3 (100%) | 1 (100%) | 1 (100%) | 0 | |
win-1251 | U | 3 (0%) | 0 | 0 | 0 |
windows-1250" | 1 (100%) | 0 | 2 (100%) | 2 (100%) | |
windows-1252" | 0 | 0 | 2 (100%) | 3 (100%) | |
windows1254 | 0 | 3 (100%) | 3 (100%) | 0 | |
0ff | U | 2 (0%) | 0 | 0 | 0 |
8859-1 | 0 | 2 (100%) | 2 (100%) | 0 | |
8859_1 | 2 (100%) | 0 | 0 | 0 | |
<$mtpublishcharset$> | U | 0 | 2 (0%) | 2 (0%) | 0 |
ansi | U | 2 (0%) | 0 | 0 | 0 |
big5,euc-jp | U | 2 (0%) | 0 | 0 | 0 |
charset | U | 2 (0%) | 0 | 0 | 0 |
charset=windows-1251 | U | 2 (0%) | 0 | 0 | 0 |
en_us | U | 2 (0%) | 0 | 0 | 0 |
es | U | 0 | 2 (0%) | 2 (0%) | 0 |
euc | U | 1 (0%) | 1 (0%) | 1 (0%) | 0 |
is0-8859-1 | U | 1 (0%) | 1 (0%) | 1 (0%) | 0 |
iso-10646 | U | 0 | 2 (0%) | 2 (0%) | 0 |
iso-8 | U | 0 | 2 (0%) | 2 (0%) | 0 |
iso-8559-2 | U | 0 | 2 (0%) | 2 (0%) | 0 |
iso-8829-2 | U | 0 | 2 (0%) | 2 (0%) | 0 |
iso-8859-1> | 0 | 2 (100%) | 2 (100%) | 0 | |
iso-88591 | 0 | 2 (100%) | 2 (100%) | 0 | |
no | U | 2 (0%) | 0 | 0 | 0 |
null | U | 0 | 2 (0%) | 2 (0%) | 0 |
uft-8 | U | 0 | 2 (0%) | 2 (0%) | 0 |
utf-32 | 1 (0%) | 1 (0%) | 1 (0%) | 0 | |
utf-9 | U | 1 (0%) | 1 (0%) | 1 (0%) | 0 |
win | U | 2 (0%) | 0 | 0 | 0 |
win-1250 | U | 1 (0%) | 1 (0%) | 1 (0%) | 0 |
win-1252 | U | 1 (0%) | 0 | 1 (0%) | 1 (0%) |
windows | U | 2 (0%) | 0 | 0 | 0 |
%charset% | U | 0 | 1 (0%) | 1 (0%) | 0 |
, | U | 1 (0%) | 0 | 0 | 0 |
.iso-8859-1 | 1 (100%) | 0 | 0 | 0 | |
0,-1 | U | 1 (0%) | 0 | 0 | 0 |
10646 | U | 0 | 1 (0%) | 1 (0%) | 0 |
1250 | U | 1 (0%) | 0 | 0 | 0 |
1254 | U | 0 | 1 (0%) | 1 (0%) | 0 |
<% | U | 0 | 1 (0%) | 1 (0%) | 0 |
U | 0 | 1 (0%) | 1 (0%) | 0 | |
_iso | U | 1 (0%) | 0 | 0 | 0 |
armscii-8 | U | 0 | 1 (0%) | 1 (0%) | 0 |
auto | U | 1 (0%) | 0 | 0 | 0 |
beagle kennel van der liniehoeve | U | 0 | 0 | 0 | 1 (0%) |
big-5 | 1 (100%) | 0 | 0 | 0 | |
big5; | 0 | 1 (100%) | 1 (100%) | 0 | |
cp1250 | 0 | 1 (100%) | 1 (100%) | 0 | |
cp1256 | 0 | 1 (100%) | 1 (100%) | 0 | |
cp852 | 0 | 1 (100%) | 1 (100%) | 0 | |
de_de@euro | U | 1 (0%) | 0 | 0 | 0 |
en-iso-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
en-us | U | 0 | 1 (0%) | 1 (0%) | 0 |
enu-kr | U | 1 (0%) | 0 | 0 | 0 |
es_es.utf8 | U | 1 (0%) | 0 | 0 | 0 |
euc-2 | U | 0 | 1 (0%) | 1 (0%) | 0 |
euckr | 1 (100%) | 0 | 0 | 0 | |
gb-2312 | 1 (100%) | 0 | 0 | 0 | |
gb2312" | 0 | 0 | 1 (0%) | 1 (0%) | |
gb2312-80 | 0 | 1 (0%) | 1 (0%) | 0 | |
gb2312;charset=iso-8859-1 | U | 1 (0%) | 0 | 0 | 0 |
gb2312\" | 0 | 0 | 1 (100%) | 1 (100%) | |
greek | 0 | 1 (100%) | 1 (100%) | 0 | |
iso-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
ibm852 | 0 | 0 | 0 | 0 | |
ico-8859-1 | U | 1 (0%) | 0 | 0 | 0 |
is0-8859-2 | U | 1 (0%) | 0 | 0 | 0 |
iso-10646-ucs-2 | 0 | 1 (0%) | 1 (0%) | 0 | |
iso-10646-utf-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-1250 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-202059-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-2022 | U | 1 (0%) | 0 | 0 | 0 |
iso-2022-kr | U | 1 (0%) | 0 | 0 | 0 |
iso-5589-2 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-8759-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-88 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-8840-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-8850-15/ | U | 0 | 0 | 1 (0%) | 1 (0%) |
iso-8851-1 | U | 1 (0%) | 0 | 0 | 0 |
iso-8859- | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-8859-1"/ | 0 | 0 | 0 | 1 (100%) | |
iso-8859-13" | 0 | 0 | 1 (100%) | 1 (100%) | |
iso-8859-15; | 1 (100%) | 0 | 0 | 0 | |
iso-8859-1;pageencoding=iso-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-8859-1s | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-8859-2" | 0 | 0 | 1 (100%) | 1 (100%) | |
iso-8859-2; | 1 (100%) | 0 | 0 | 0 | |
iso-8859-2> | 0 | 1 (100%) | 1 (100%) | 0 | |
iso-8859-6 | 1 (100%) | 1 (100%) | 1 (100%) | 0 | |
iso-8859-9" | 0 | 0 | 1 (100%) | 1 (100%) | |
iso-8859-l | U | 0 | 0 | 1 (0%) | 1 (0%) |
iso-8895-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-99-59-1 | U | 0 | 0 | 1 (0%) | 1 (0%) |
iso-9959-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-iso-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso-utf-8 | U | 0 | 1 (0%) | 1 (0%) | 0 |
iso.8859-15 | 1 (100%) | 0 | 0 | 0 | |
iso8859-15 | 0 | 1 (100%) | 1 (100%) | 0 | |
iso8859-7 | 0 | 1 (100%) | 1 (100%) | 0 | |
iso88591 | 0 | 1 (100%) | 1 (100%) | 0 | |
iso_8859-15 | 1 (100%) | 0 | 0 | 0 | |
iso_8859-9 | 0 | 1 (100%) | 1 (100%) | 0 | |
iso_8859_1 | 0 | 1 (100%) | 1 (100%) | 0 | |
it_it.iso8859-15 | U | 0 | 1 (0%) | 1 (0%) | 0 |
koi8-u | 1 (100%) | 1 (100%) | 1 (100%) | 0 | |
ksc5601 | 1 (100%) | 0 | 0 | 0 | |
langtagcharsetiso | U | 0 | 1 (0%) | 1 (0%) | 0 |
latin | U | 1 (0%) | 0 | 0 | 0 |
latin2 | 0 | 1 (100%) | 1 (100%) | 0 | |
ms932 | 1 (100%) | 0 | 0 | 0 | |
nl_nl.iso8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
pl | U | 1 (0%) | 0 | 0 | 0 |
pt-iso-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
shft_jis | U | 1 (0%) | 0 | 0 | 0 |
shift_jis; | 0 | 0 | 1 (100%) | 1 (100%) | |
sjis | 1 (100%) | 0 | 0 | 0 | |
unicode-1-1-utf-8 | U | 0 | 1 (0%) | 1 (0%) | 0 |
user-defined | U | 0 | 1 (0%) | 1 (0%) | 0 |
utf-8"" | 0 | 0 | 1 (100%) | 1 (100%) | |
utf8_czech_cs | U | 1 (0%) | 0 | 0 | 0 |
western | U | 0 | 1 (0%) | 1 (0%) | 0 |
white | U | 0 | 1 (0%) | 1 (0%) | 0 |
widows-1250 | U | 0 | 1 (0%) | 1 (0%) | 0 |
win1251 | U | 1 (0%) | 0 | 0 | 0 |
window-874 | U | 0 | 1 (0%) | 1 (0%) | 0 |
windows-1250; | 0 | 1 (100%) | 1 (100%) | 0 | |
windows-1251" | 0 | 0 | 1 (100%) | 1 (100%) | |
windows-1252' | 0 | 1 (100%) | 1 (100%) | 0 | |
windows-1252romance | U | 0 | 1 (0%) | 1 (0%) | 0 |
windows-1255; | 0 | 1 (100%) | 1 (100%) | 0 | |
windows-2252 | U | 0 | 1 (0%) | 1 (0%) | 0 |
windows-8859 | U | 0 | 1 (0%) | 1 (0%) | 0 |
windows-8859-1 | U | 0 | 1 (0%) | 1 (0%) | 0 |
windows-8859-2 | U | 0 | 1 (0%) | 1 (0%) | 0 |
windows-8859-2" | U | 0 | 1 (0%) | 1 (0%) | 0 |
windows-932 | 1 (100%) | 0 | 0 | 0 | |
x-mac-thai | U | 0 | 1 (0%) | 1 (0%) | 0 |
x-mac-turkish | 0 | 0 | 0 | 1 (100%) | |
x-x-big5 | U | 0 | 1 (0%) | 1 (0%) | 0 |
{charset} | U | 0 | 1 (0%) | 1 (0%) | 0 |