In order to check if a string is encoded correctly in utf-8, I suggest the following function, that implements the RFC3629 better than mb_check_encoding():
<?php
function check_utf8($str) {
$len = strlen($str);
for($i = 0; $i < $len; $i++){
$c = ord($str[$i]);
if ($c > 128) {
if (($c > 247)) return false;
elseif ($c > 239) $bytes = 4;
elseif ($c > 223) $bytes = 3;
elseif ($c > 191) $bytes = 2;
else return false;
if (($i + $bytes) > $len) return false;
while ($bytes > 1) {
$i++;
$b = ord($str[$i]);
if ($b < 128 || $b > 191) return false;
$bytes--;
}
}
}
return true;
} // end of check_utf8
?>
mb_check_encoding
(PHP 4 >= 4.4.3, PHP 5 >= 5.1.3)
mb_check_encoding — Check if the string is valid for the specified encoding
Description
bool mb_check_encoding
([ string
$var = NULL
[, string $encoding = mb_internal_encoding()
]] )Checks if the specified byte stream is valid for the specified encoding. It is useful to prevent so-called "Invalid Encoding Attack".
Parameters
-
var -
The byte stream to check. If it is omitted, this function checks all the input from the beginning of the request.
-
encoding -
The expected encoding.
Return Values
Returns TRUE on success or FALSE on failure.
javalc6 at gmail dot com ¶
3 years ago
jbricci at ya-right dot com ¶
4 years ago
This function does not check for bad byte sequence(s), it only checks if the byte stream is valid. If you want to verify a encoded string is valid, (IE: does not contain any bad byte sequences do the following...
<?php
/* check a strings encoded value */
function checkEncoding ( $string, $string_encoding )
{
$fs = $string_encoding == 'UTF-8' ? 'UTF-32' : $string_encoding;
$ts = $string_encoding == 'UTF-32' ? 'UTF-8' : $string_encoding;
return $string === mb_convert_encoding ( mb_convert_encoding ( $string, $fs, $ts ), $ts, $fs );
}
/* test 1 variables */
$string = "\x00\x81";
$encoding = "Shift_JIS";
/* test 1 mb_check_encoding (test for bad byte stream) */
if ( true === mb_check_encoding ( $string, $encoding ) )
{
echo 'valid (' . $encoding . ') encoded byte stream!<br />';
}
else
{
echo 'invalid (' . $encoding . ') encoded byte stream!<br />';
}
/* test 1 checkEncoding (test for bad byte sequence(s)) */
if ( true === checkEncoding ( $string, $encoding ) )
{
echo 'valid (' . $encoding . ') encoded byte sequence!<br />';
}
else
{
echo 'invalid (' . $encoding . ') encoded byte sequence!<br />';
}
/* test 2 */
/* test 2 variables */
$string = "\x00\xE3";
$encoding = "UTF-8";
/* test 2 mb_check_encoding (test for bad byte stream) */
if ( true === mb_check_encoding ( $string, $encoding ) )
{
echo 'valid (' . $encoding . ') encoded byte stream!<br />';
}
else
{
echo 'invalid (' . $encoding . ') encoded byte stream!<br />';
}
/* test 2 checkEncoding (test for bad byte sequence(s)) */
if ( true === checkEncoding ( $string, $encoding ) )
{
echo 'valid (' . $encoding . ') encoded byte sequence!<br />';
}
else
{
echo 'invalid (' . $encoding . ') encoded byte sequence!<br />';
}
?>
richard at phase dot org ¶
11 months ago
The issue whereby mb_check_encoding($string,'UTF-8') falsely returns true for invalid UTF8 byte sequences was resolved somewhere between
PHP 5.2.0 and 5.2.6
The following equivalence seems to work in PHP 5.2.0 and 5.1.6
$valid_utf8 = (@iconv('UTF-8','UTF-8',$string) === $string);
(with apologies for the @)
