We were having very peculiar behavior regarding foreign characters such as e-acute.
However, it was only showing up as a problem when extracting those characters out of our mysql database and when being displayed through a proxy server of ours that handles dns issues.
As other users have made a note of, the default character setting wasn't what they were expecting it to be when they left theirs blank.
When we changed our default_charset to "UTF-8", our problems and needs for using functions like these were no longer necessary in handling foreign characters such as e-acute. Good enough for us!
html_entity_decode
(PHP 4 >= 4.3.0, PHP 5)
html_entity_decode — Converte todas as entidades HTML para os seus caracteres
Descrição
html_entity_decode() é o oposto da função htmlentities() no que converte todas as entidades HTML para os seus caracteres de string.
Parâmetros
- string
-
A string de entrada.
- quote_style
-
O segundo parâmetro, que é opcional, quote_style permite você definir o que será feito com 'apostrofos' e "aspas". Ele recebe uma constante entre três, sendo o padrão ENT_COMPAT:
Constantes disponíveis para quote_style Nome da Constante Descrição ENT_COMPAT Irá converter aspas e deixar os apostrofos. ENT_QUOTES Irá converter ambos. ENT_NOQUOTES Irá deixar ambos sem converter. - charset
-
O conjunto de caracteres ISO-8859-1 é usado como padrão para o terceiro parâmetro, que é opcional, charset. Este define o conjunto de caracteres usado na conversão.
Os seguintes conjuntos de caracteres são suportados no PHP 4.3.0 e posterior.
Conjuntos de caracteres suportados Conjunto de caracteres Apelidos Descrição ISO-8859-1 ISO8859-1 Western European, Latin-1 ISO-8859-15 ISO8859-15 Western European, Latin-9. Adiciona o símbolo do Euro, letras Francesas e Filandesas faltando no Latin-1(ISO-8859-1). UTF-8 Código de multi-byte 8-bit Unicode compatível com ASCII. cp866 ibm866, 866 Conjunto de caracteres do DOS específico para o Russo. Este conjunto de caracteres é suportado no 4.3.2. cp1251 Windows-1251, win-1251, 1251 Conjunto de caracteres do Windows específico para o Russo. Este conjunto de caracteres é suportado no 4.3.2. cp1252 Windows-1252, 1252 Conjunto de caracteres do Windows específico para a Europa Ocidental. KOI8-R koi8-ru, koi8r Russo. Este conjunto de caracteres é suportado no 4.3.2. BIG5 950 Chinês Tradicional, usado principalmente em Taiwan. GB2312 936 Chins Simplificado, conjunto de caracteres padrão nacional. BIG5-HKSCS Big5 com extenções de Hong Kong, Chinês Tradicional. Shift_JIS SJIS, 932 Japonês EUC-JP EUCJP Japonês Nota: Qualquer outro conjunto de caracteres não é reconhecido e será usado o ISO-8859-1.
Valor Retornado
Retorna a string decodificada.
Histórico
| Versão | Descrição |
|---|---|
| 5.0.0 | Suporte para conjunto de caracteres multi-byte foi adicionado. |
Exemplos
Exemplo #1 Decodificando entidades HTML
<?php
$orig = "I'll \"walk\" the <b>dog</b> now";
$a = htmlentities($orig);
$b = html_entity_decode($a);
echo $a; // I'll "walk" the <b>dog</b> now
echo $b; // I'll "walk" the <b>dog</b> now
// For users prior to PHP 4.3.0 you may do this:
function unhtmlentities($string)
{
// replace numeric entities
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
// replace literal entities
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl = array_flip($trans_tbl);
return strtr($string, $trans_tbl);
}
$c = unhtmlentities($a);
echo $c; // I'll "walk" the <b>dog</b> now
?>
Notas
Nota:
Você deve imaginar porque trim(html_entity_decode(' ')); não reduz a string para uma string vazia, isto é porque a entidade ' ' não é o código ASCII 32 (o qual é retirado por trim()) mas o caracter ASCII 160 (0xa0) no conjunto de caracteres padrão.
Veja Também
- htmlentities() - Converte todos os caracteres aplicáveis em entidades html.
- htmlspecialchars() - Converte caracteres especiais para a realidade HTML
- get_html_translation_table() - Retorna a tabela de tradução usada por htmlspecialchars e htmlentities
- urldecode() - Decodifica uma URL codificada
If you need something that converts &#[0-9]+ entities to UTF-8, this is simple and works:
<?php
/* Entity crap. /
$input = "Fovič";
$output = preg_replace_callback("/(&#[0-9]+;)/", function($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); }, $input);
/* Plain UTF-8. */
echo $output;
?>
BE AWARE: The documentation around the default charset might be wrong.
The changelog says:
5.3.3 Default charset changed from ISO-8859-1 to UTF-8.
Despite the fact that we are running 5.3.3-7 when we do
html_entity_decode(" ", ENT_QUOTES);
we get "\xa0" the ISO-8859-1 version of a non breaking space.
When we change this to:
html_entity_decode(" ", ENT_QUOTES, 'UTF-8');
we properly get "\xc2\xa0"
Implying that 'UTF-8' is NOT the default for our installation of php.
This is a safe rawurldecode with utf8 detection:
<?php
function utf8_rawurldecode($raw_url_encoded){
$enc = rawurldecode($raw_url_encoded);
if(utf8_encode(utf8_decode($enc))==$enc){;
return rawurldecode($raw_url_encoded);
}else{
return utf8_encode(rawurldecode($raw_url_encoded));
}
}
?>
Handy function to convert remaining HTML-entities into human readable chars (for entities which do not exist in target charset):
<?php
function cleanString($in,$offset=null)
{
$out = trim($in);
if (!empty($out))
{
$entity_start = strpos($out,'&',$offset);
if ($entity_start === false)
{
// ideal
return $out;
}
else
{
$entity_end = strpos($out,';',$entity_start);
if ($entity_end === false)
{
return $out;
}
// zu lang um eine entity zu sein
else if ($entity_end > $entity_start+7)
{
// und weiter gehts
$out = cleanString($out,$entity_start+1);
}
// gottcha!
else
{
$clean = substr($out,0,$entity_start);
$subst = substr($out,$entity_start+1,1);
// š => "s" / š => "_"
$clean .= ($subst != "#") ? $subst : "_";
$clean .= substr($out,$entity_end+1);
// und weiter gehts
$out = cleanString($clean,$entity_start+1);
}
}
}
return $out;
}
?>
I wrote in a previous comment that html_entity_decode() only handled about 100 characters. That's not quite true; it only handles entities that exist in the output character set (the third argument). If you want to get ALL HTML entities, make sure you use ENT_QUOTES and set the third argument to 'UTF-8'.
If you don't want a UTF-8 string, you'll need to convert it afterward with something like utf8_decode(), iconv(), or mb_convert_encoding().
If you're producing XML, which doesn't recognise most HTML entities:
When producing a UTF-8 document (the default), then htmlspecialchars(html_entity_decode($string, ENT_QUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8') (because you only need to escape < and > and & unless you're printing inside the XML tags themselves).
Otherwise, either convert all the named entities to numeric ones, or declare the named entities in the document's DTD. The full list of 252 entities can be found in the HTML 4.01 Spec, or you can cut and paste the function from my site (http://inanimatt.com/php-convert-entities.php).
I just ran into the:
Bug #27626 html_entity_decode bug - cannot yet handle MBCS in html_entity_decode()!
The simple solution if you're still running PHP 4 is to wrap the html_entity_decode() function with the utf8_decode() function.
<?php
$string = ' ';
$utf8_encode = utf8_encode(html_entity_decode($string));
?>
By default html_entity_decode() returns the ISO-8859-1 character set, and by default utf8_decode()...
http://us.php.net/manual/en/function.utf8-decode.php
"Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1"
I created this function to filter all the text that goes in or comes out of the database.
<?php
function filter_string($string, $nohtml='', $save='') {
if(!empty($nohtml)) {
$string = trim($string);
if(!empty($save)) $string = htmlentities(trim($string), ENT_QUOTES, 'ISO-8859-15');
else $string = html_entity_decode($string, ENT_QUOTES, 'ISO-8859-15');
}
if(!empty($save)) $string = mysql_real_escape_string($string);
else $string = stripslashes($string);
return($string);
}
?>
You may want to specify the character set if you see unexpected behavior. Here is an example.
# cat test.php
<?php
$str = '!';
$quotes = html_entity_decode($str, ENT_QUOTES);
$noquotes = html_entity_decode($str, ENT_NOQUOTES);
$noquotesutf8 = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');
echo "quotes='$quotes', noquotes='$noquotes', noquotesutf8='$noquotesutf8'\n";
?>
# php test.php
quotes='!', noquotes='!', noquotesutf8='!'
the references to 'chr()' in the example unhtmlentities() function should be changed to unichr, using the example unichr() function described in the 'chr' reference (http://php.net/chr).
the reason for this is characters such as € which do not break down into an ASCII number (that's the Euro, by the way).
I had a problem getting the 'TM' trademark symbol to display correctly in an email subject line. Using html_entity_decode() with different charsets didn't work, but directly replacing the entity with it's ASCII equivalent did:
$subject = str_replace('™', chr(153), $subject);
The decipherment does the character encoded by the escape function of JavaScript.
When the multi byte is used on the page, it is effective.
javascript escape('aaああaa') ..... 'aa%u3042%u3042aa'
php jsEscape_decode('aa%u3042%u3042aa')..'aaああaa'
<?php
function jsEscape_decode($jsEscaped,$outCharCode='SJIS'){
$arrMojis = explode("%u",$jsEscaped);
for ($i = 1;$i < count($arrMojis);$i++){
$c = substr($arrMojis[$i],0,4);
$cc = mb_convert_encoding(pack('H*',$c),$outCharCode,'UTF-16');
$arrMojis[$i] = substr_replace($arrMojis[$i],$cc,0,4);
}
return implode('',$arrMojis);
}
?>
here's a simple workaround for the UTF-8 support problem
<?php
$var=iconv("UTF-8","ISO-8859-1",$var);
$var=html_entity_decode($var, ENT_QUOTES, 'ISO-8859-1');
$var=iconv("ISO-8859-1","UTF-8",$var);
?>
Here is the ultimate functions to convert HTML entities to UTF-8 :
The main function is htmlentities2utf8
Others are helper functions
<?php
function chr_utf8($code)
{
if ($code < 0) return false;
elseif ($code < 128) return chr($code);
elseif ($code < 160) // Remove Windows Illegals Cars
{
if ($code==128) $code=8364;
elseif ($code==129) $code=160; // not affected
elseif ($code==130) $code=8218;
elseif ($code==131) $code=402;
elseif ($code==132) $code=8222;
elseif ($code==133) $code=8230;
elseif ($code==134) $code=8224;
elseif ($code==135) $code=8225;
elseif ($code==136) $code=710;
elseif ($code==137) $code=8240;
elseif ($code==138) $code=352;
elseif ($code==139) $code=8249;
elseif ($code==140) $code=338;
elseif ($code==141) $code=160; // not affected
elseif ($code==142) $code=381;
elseif ($code==143) $code=160; // not affected
elseif ($code==144) $code=160; // not affected
elseif ($code==145) $code=8216;
elseif ($code==146) $code=8217;
elseif ($code==147) $code=8220;
elseif ($code==148) $code=8221;
elseif ($code==149) $code=8226;
elseif ($code==150) $code=8211;
elseif ($code==151) $code=8212;
elseif ($code==152) $code=732;
elseif ($code==153) $code=8482;
elseif ($code==154) $code=353;
elseif ($code==155) $code=8250;
elseif ($code==156) $code=339;
elseif ($code==157) $code=160; // not affected
elseif ($code==158) $code=382;
elseif ($code==159) $code=376;
}
if ($code < 2048) return chr(192 | ($code >> 6)) . chr(128 | ($code & 63));
elseif ($code < 65536) return chr(224 | ($code >> 12)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63));
else return chr(240 | ($code >> 18)) . chr(128 | (($code >> 12) & 63)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63));
}
// Callback for preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $str);
function html_entity_replace($matches)
{
if ($matches[2])
{
return chr_utf8(hexdec($matches[3]));
} elseif ($matches[1])
{
return chr_utf8($matches[3]);
}
switch ($matches[3])
{
case "nbsp": return chr_utf8(160);
case "iexcl": return chr_utf8(161);
case "cent": return chr_utf8(162);
case "pound": return chr_utf8(163);
case "curren": return chr_utf8(164);
case "yen": return chr_utf8(165);
//... etc with all named HTML entities
}
return false;
}
function htmlentities2utf8 ($string) // because of the html_entity_decode() bug with UTF-8
{
$string = preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $string);
return $string;
}
?>
If you want to decode NCRs to utf-8 use this function instead of chr().
<?php
function utf8_chr($code)
{
if($code<128) return chr($code);
else if($code<2048) return chr(($code>>6)+192).chr(($code&63)+128);
else if($code<65536) return chr(($code>>12)+224).chr((($code>>6)&63)+128).chr(($code&63)+128);
else if($code<2097152) return chr($code>>18+240).chr((($code>>12)&63)+128)
.chr(($code>>6)&63+128).chr($code&63+128));
}
?>
Note that
<?php
echo urlencode(html_entity_decode(" "));
?>
will output "%A0" instead of "+".
[If you are missing the html_entity_decode() function in your version of PHP, you may wish to try this code snippet.]
<?php
if( !function_exists( 'html_entity_decode' ) )
{
function html_entity_decode( $given_html, $quote_style = ENT_QUOTES ) {
$trans_table = array_flip(get_html_translation_table( HTML_SPECIALCHARS, $quote_style ));
$trans_table['''] = "'";
return ( strtr( $given_html, $trans_table ) );
}
}
?>
To convert html entities into unicode characters, use the following:
<?php
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
foreach($trans_tbl as $k => $v)
{
$ttr[$v] = utf8_encode($k);
}
$text = strtr($text, $ttr);
?>
Quick & dirty code that translates numeric entities to UTF-8.
<?php
function replace_num_entity($ord)
{
$ord = $ord[1];
if (preg_match('/^x([0-9a-f]+)$/i', $ord, $match))
{
$ord = hexdec($match[1]);
}
else
{
$ord = intval($ord);
}
$no_bytes = 0;
$byte = array();
if ($ord < 128)
{
return chr($ord);
}
elseif ($ord < 2048)
{
$no_bytes = 2;
}
elseif ($ord < 65536)
{
$no_bytes = 3;
}
elseif ($ord < 1114112)
{
$no_bytes = 4;
}
else
{
return;
}
switch($no_bytes)
{
case 2:
{
$prefix = array(31, 192);
break;
}
case 3:
{
$prefix = array(15, 224);
break;
}
case 4:
{
$prefix = array(7, 240);
}
}
for ($i = 0; $i < $no_bytes; $i++)
{
$byte[$no_bytes - $i - 1] = (($ord & (63 * pow(2, 6 * $i))) / pow(2, 6 * $i)) & 63 | 128;
}
$byte[0] = ($byte[0] & $prefix[0]) | $prefix[1];
$ret = '';
for ($i = 0; $i < $no_bytes; $i++)
{
$ret .= chr($byte[$i]);
}
return $ret;
}
$test = 'This is a čא test'';
echo $test . "<br />\n";
echo preg_replace_callback('/&#([0-9a-fx]+);/mi', 'replace_num_entity', $test);
?>
Passing NULL or FALSE as a string will generate a '500 Internal Server Error' (or break the script when inside a function).
So always test your string first before passing it to html_entity_decode().
This function seems to have to have two limitations (at least in PHP 4.3.8):
a) it does not work with multibyte character codings, such as UTF-8
b) it does not decode numeric entity references
a) can be solved by using iconv to convert to ISO-8859-1, then decoding the entities, than convert to UTF-8 again. But that's quite ugly and detroys all characters not present in Latin-1.
b) can be solved rather nicely using the following code:
<?php
function decode_entities($text) {
$text= html_entity_decode($text,ENT_QUOTES,"ISO-8859-1"); #NOTE: UTF-8 does not work!
$text= preg_replace('/&#(\d+);/me',"chr(\\1)",$text); #decimal notation
$text= preg_replace('/&#x([a-f0-9]+);/mei',"chr(0x\\1)",$text); #hex notation
return $text;
}
?>
HTH
This functionality is now implemented in the PEAR package PHP_Compat.
More information about using this function without upgrading your version of PHP can be found on the below link:
http://pear.php.net/package/PHP_Compat
