The source data is 131,072 randomly-selected HTTP URIs from the Open Directory Project, downloaded on 2008-02-26. (There is obvious bias in many directions, and less obvious bias in many more, so be careful in interpreting the data.)

In the table below, HTTP is the charset from HTTP Content-Type header, using the "algorithm for extracting an encoding from a Content-Type" from HTML5 as of 2008-03-05. meta content is from the first <meta ... content="...">. meta charset is from the first <meta ... charset="...">. Sniffer is the "encoding sniffing algorithm" run on the entire document.

Decoding was done with ICU4J 3.8.1. Parsing was done with the Validator.nu HTML parser 1.0.6.

Strings were lowercased before comparison.

The lists of pages show up to 100 – the URIs with the lowest MD5s are chosen if the list is too large.

Successfully downloaded text/html pages126989

Encoding sniffing vs bytes examined

Bytes used Pages with charset sniffed
(all)89695
25651513 (57%)
51273534 (82%)
76879101 (88%)
102482692 (92%)
153686192 (96%)
204887714 (98%)
409688971 (99%)

Also see detailed graph.

Encoding usage frequencies

Number of pages declaring encoding (% decoded without errors)
Charset HTTP meta content Sniffer meta charset
iso-8859-110320 (100%)53318 (100%)53365 (100%)66 (100%)
utf-812253 (96%)11998 (96%)12006 (96%)8 (88%)
windows-1252119 (100%)11606 (100%)11618 (100%)4 (100%)
shift_jis85 (100%)3587 (99%)3592 (99%)6 (100%)
iso-8859-2269 (100%)1909 (100%)1914 (100%)9 (100%)
windows-1251991 (100%)1606 (100%)1616 (100%)9 (100%)
windows-125048 (100%)958 (100%)962 (100%)4 (100%)
gb231268 (84%)755 (84%)759 (84%)2 (50%)
iso-8859-15151 (100%)330 (100%)333 (100%)1 (100%)
us-ascii179 (99%)343 (95%)343 (95%)0
windows-125412 (100%)355 (100%)354 (100%)0
big547 (94%)343 (99%)344 (99%)1 (100%)
iso-8859-965 (100%)306 (100%)307 (100%)0
x-sjis1 (100%)331 (97%)331 (97%)0
euc-jp119 (86%)294 (90%)294 (90%)0
iso8859-122 (100%)144 (100%)146 (100%)2 (100%)
windows-125521 (100%)152 (100%)156 (100%)4 (100%)
U23 (0%)129 (0%)128 (1%)0
euc-kr17 (94%)136 (98%)138 (98%)2 (100%)
windows-125714 (100%)134 (100%)135 (100%)0
windows-125619 (100%)135 (100%)135 (100%)0
koi8-r85 (100%)39 (100%)39 (100%)0
noneU92 (0%)000
iso-8859-714 (100%)68 (100%)68 (100%)0
windows-12533 (100%)76 (100%)76 (100%)0
windows-8744 (100%)58 (100%)58 (100%)0
x-windows-874U0000
windows-1252;54 (100%)000
utf-8;33 (100%)12 (92%)12 (92%)0
iso-8559-1U2 (0%)30 (0%)30 (0%)0
tis-62013 (69%)22 (68%)22 (68%)0
iso-2022-jpU3 (0%)28 (0%)28 (0%)0
iso-8859-81 (100%)17 (94%)17 (94%)0
iso-8859-1"0024 (100%)26 (100%)
iso-8859-1;7 (100%)17 (100%)17 (100%)0
utf810 (100%)12 (83%)12 (83%)0
gbk10 (100%)15 (100%)15 (100%)0
unicode018 (56%)18 (56%)0
cp125115 (100%)1 (100%)1 (100%)0
latin112 (100%)4 (100%)4 (100%)0
utf-16016 (50%)16 (50%)0
x-mac-romanU015 (0%)15 (0%)0
shift-jis2 (100%)11 (100%)12 (100%)1 (100%)
cp125212 (100%)000
enU11 (0%)1 (0%)1 (0%)0
macintosh012 (100%)12 (100%)0
iso-8859-8-i3 (100%)10 (100%)10 (100%)0
x-euc-jp010 (100%)10 (100%)0
ks_c_5601-198709 (100%)9 (100%)0
.utf88 (88%)000
isoU08 (0%)8 (0%)0
cp-12516 (100%)1 (100%)1 (100%)0
iso-8859-507 (100%)7 (100%)0
0U6 (0%)000
bs_4730U6 (0%)000
iso-U06 (0%)6 (0%)0
iso-8859-1,1 (100%)6 (100%)6 (100%)0
_charsetU1 (0%)4 (0%)4 (0%)0
ascii05 (80%)5 (80%)0
euc_kr5 (100%)000
iso-8859-41 (100%)5 (100%)5 (100%)0
utf-8"1 (100%)1 (100%)3 (100%)3 (67%)
visualU03 (0%)3 (0%)2 (0%)
windows-31j4 (100%)1 (100%)1 (100%)0
charset=iso-8859-1U04 (0%)4 (0%)0
iso-8859U1 (0%)3 (0%)3 (0%)0
iso-8859-1303 (100%)3 (100%)0
iso8859-22 (100%)2 (100%)2 (100%)0
iso8859_13 (100%)2 (100%)2 (100%)0
iso_8859-14 (100%)000
ucs-204 (50%)4 (50%)0
x-user-definedU04 (0%)4 (0%)1 (0%)
charset=utf-8U03 (0%)3 (0%)0
euc_jp2 (100%)1 (100%)1 (100%)0
iso-8859-303 (67%)3 (67%)0
iso.8859-13 (100%)000
iso8859-91 (100%)2 (100%)2 (100%)0
latin-13 (100%)1 (100%)1 (100%)0
win-1251U3 (0%)000
windows-1250"1 (100%)02 (100%)2 (100%)
windows-1252"002 (100%)3 (100%)
windows125403 (100%)3 (100%)0
0ffU2 (0%)000
8859-102 (100%)2 (100%)0
8859_12 (100%)000
<$mtpublishcharset$>U02 (0%)2 (0%)0
ansiU2 (0%)000
big5,euc-jpU2 (0%)000
charsetU2 (0%)000
charset=windows-1251U2 (0%)000
en_usU2 (0%)000
esU02 (0%)2 (0%)0
eucU1 (0%)1 (0%)1 (0%)0
is0-8859-1U1 (0%)1 (0%)1 (0%)0
iso-10646U02 (0%)2 (0%)0
iso-8U02 (0%)2 (0%)0
iso-8559-2U02 (0%)2 (0%)0
iso-8829-2U02 (0%)2 (0%)0
iso-8859-1>02 (100%)2 (100%)0
iso-8859102 (100%)2 (100%)0
noU2 (0%)000
nullU02 (0%)2 (0%)0
uft-8U02 (0%)2 (0%)0
utf-321 (0%)1 (0%)1 (0%)0
utf-9U1 (0%)1 (0%)1 (0%)0
winU2 (0%)000
win-1250U1 (0%)1 (0%)1 (0%)0
win-1252U1 (0%)01 (0%)1 (0%)
windowsU2 (0%)000
%charset%U01 (0%)1 (0%)0
,U1 (0%)000
.iso-8859-11 (100%)000
0,-1U1 (0%)000
10646U01 (0%)1 (0%)0
1250U1 (0%)000
1254U01 (0%)1 (0%)0
<%U01 (0%)1 (0%)0
U01 (0%)1 (0%)0
_isoU1 (0%)000
armscii-8U01 (0%)1 (0%)0
autoU1 (0%)000
beagle kennel van der liniehoeveU0001 (0%)
big-51 (100%)000
big5;01 (100%)1 (100%)0
cp125001 (100%)1 (100%)0
cp125601 (100%)1 (100%)0
cp85201 (100%)1 (100%)0
de_de@euroU1 (0%)000
en-iso-8859-1U01 (0%)1 (0%)0
en-usU01 (0%)1 (0%)0
enu-krU1 (0%)000
es_es.utf8U1 (0%)000
euc-2U01 (0%)1 (0%)0
euckr1 (100%)000
gb-23121 (100%)000
gb2312"001 (0%)1 (0%)
gb2312-8001 (0%)1 (0%)0
gb2312;charset=iso-8859-1U1 (0%)000
gb2312\"001 (100%)1 (100%)
greek01 (100%)1 (100%)0
iso-8859-1U01 (0%)1 (0%)0
ibm8520000
ico-8859-1U1 (0%)000
is0-8859-2U1 (0%)000
iso-10646-ucs-201 (0%)1 (0%)0
iso-10646-utf-1U01 (0%)1 (0%)0
iso-1250U01 (0%)1 (0%)0
iso-202059-1U01 (0%)1 (0%)0
iso-2022U1 (0%)000
iso-2022-krU1 (0%)000
iso-5589-2U01 (0%)1 (0%)0
iso-8759-1U01 (0%)1 (0%)0
iso-88U01 (0%)1 (0%)0
iso-8840-1U01 (0%)1 (0%)0
iso-8850-15/U001 (0%)1 (0%)
iso-8851-1U1 (0%)000
iso-8859-U01 (0%)1 (0%)0
iso-8859-1"/0001 (100%)
iso-8859-13"001 (100%)1 (100%)
iso-8859-15;1 (100%)000
iso-8859-1;pageencoding=iso-8859-1U01 (0%)1 (0%)0
iso-8859-1sU01 (0%)1 (0%)0
iso-8859-2"001 (100%)1 (100%)
iso-8859-2;1 (100%)000
iso-8859-2>01 (100%)1 (100%)0
iso-8859-61 (100%)1 (100%)1 (100%)0
iso-8859-9"001 (100%)1 (100%)
iso-8859-lU001 (0%)1 (0%)
iso-8895-1U01 (0%)1 (0%)0
iso-99-59-1U001 (0%)1 (0%)
iso-9959-1U01 (0%)1 (0%)0
iso-iso-8859-1U01 (0%)1 (0%)0
iso-utf-8U01 (0%)1 (0%)0
iso.8859-151 (100%)000
iso8859-1501 (100%)1 (100%)0
iso8859-701 (100%)1 (100%)0
iso8859101 (100%)1 (100%)0
iso_8859-151 (100%)000
iso_8859-901 (100%)1 (100%)0
iso_8859_101 (100%)1 (100%)0
it_it.iso8859-15U01 (0%)1 (0%)0
koi8-u1 (100%)1 (100%)1 (100%)0
ksc56011 (100%)000
langtagcharsetisoU01 (0%)1 (0%)0
latinU1 (0%)000
latin201 (100%)1 (100%)0
ms9321 (100%)000
nl_nl.iso8859-1U01 (0%)1 (0%)0
plU1 (0%)000
pt-iso-8859-1U01 (0%)1 (0%)0
shft_jisU1 (0%)000
shift_jis;001 (100%)1 (100%)
sjis1 (100%)000
unicode-1-1-utf-8U01 (0%)1 (0%)0
user-definedU01 (0%)1 (0%)0
utf-8""001 (100%)1 (100%)
utf8_czech_csU1 (0%)000
westernU01 (0%)1 (0%)0
whiteU01 (0%)1 (0%)0
widows-1250U01 (0%)1 (0%)0
win1251U1 (0%)000
window-874U01 (0%)1 (0%)0
windows-1250;01 (100%)1 (100%)0
windows-1251"001 (100%)1 (100%)
windows-1252'01 (100%)1 (100%)0
windows-1252romanceU01 (0%)1 (0%)0
windows-1255;01 (100%)1 (100%)0
windows-2252U01 (0%)1 (0%)0
windows-8859U01 (0%)1 (0%)0
windows-8859-1U01 (0%)1 (0%)0
windows-8859-2U01 (0%)1 (0%)0
windows-8859-2"U01 (0%)1 (0%)0
windows-9321 (100%)000
x-mac-thaiU01 (0%)1 (0%)0
x-mac-turkish0001 (100%)
x-x-big5U01 (0%)1 (0%)0
{charset}U01 (0%)1 (0%)0

List of pages per encoding

Invalid:
%charset%
Invalid:
,
Invalid:
.iso-8859-1
Valid:
.utf8
Valid: Invalid:
0
Invalid:
0,-1
Invalid:
0ff
Invalid:
10646
Invalid:
1250
Invalid:
1254
Invalid:
8859-1
Valid:
8859_1
Valid:
<$mtpublishcharset$>
Invalid:
<%
Invalid:
<?php
Invalid:
_charset
Invalid:
_iso
Invalid:
ansi
Invalid:
armscii-8
Invalid:
ascii
Valid: Invalid:
auto
Invalid:
beagle kennel van der liniehoeve
Invalid:
big-5
Valid:
big5
Valid: Invalid:
big5,euc-jp
Invalid:
big5;
Valid:
bs_4730
Invalid:
charset
Invalid:
charset=iso-8859-1
Invalid:
charset=utf-8
Invalid:
charset=windows-1251
Invalid:
cp-1251
Valid:
cp1250
Valid:
cp1251
Valid:
cp1252
Valid:
cp1256
Valid:
cp852
Valid:
de_de@euro
Invalid:
en
Invalid:
en-iso-8859-1
Invalid:
en-us
Invalid:
en_us
Invalid:
enu-kr
Invalid:
es
Invalid:
es_es.utf8
Invalid:
euc
Invalid:
euc-2
Invalid:
euc-jp
Valid: Invalid:
euc-kr
Valid: Invalid:
euc_jp
Valid:
euc_kr
Valid:
euckr
Valid:
gb-2312
Valid:
gb2312
Valid: Invalid: