mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5, PHP 7, PHP 8)

mb_detect_encodingОпределение кодировки символов

Описание

mb_detect_encoding(string $string, array|string|null $encodings = null, bool $strict = false): string|false

Определяет наиболее вероятную кодировку символов для строки (string) string из упорядоченного списка кандидатов.

Автоматическое определение предполагаемой кодировки символов не может быть полностью надёжным; без дополнительной информации это похоже на расшифровку зашифрованной строки без ключа. Всегда предпочтительно использовать индикацию кодировки символов, хранящуюся или передаваемую с данными, такую как HTTP-заголовок "Content-Type".

Функция наиболее полезна с многобайтовыми кодировками, когда не все последовательности байтов образуют допустимую строку. Если входная строка содержит такую последовательность, эта кодировка будет отклонена, и будет проверена следующая кодировка.

Список параметров

string

Проверяемая строка (string).

encodings

Упорядоченный список кодировок символов. Список может быть указан как массив строк или как строка кодировок, разделённых запятыми.

Если encodings не задан или является null, будет использоваться текущий detect_order (установленный с помощью параметра конфигурации mbstring.detect_order или функции mb_detect_order()).

strict

Управляет поведением, когда string недопустима ни в одной из перечисленных encodings. Если для strict установлено значение false, будет возвращена наиболее подходящая кодировка; если для strict установлено значение true, будет возвращено false.

Значение по умолчанию для strict можно установить с помощью параметра конфигурации mbstring.strict_detection.

Возвращаемые значения

Название кодировки символов или false, если строка недопустима ни в одной из перечисленных кодировок.

Примеры

Пример #1 Пример использования mb_detect_encoding()

<?php
// Определение кодировки с текущим detect_order
echo mb_detect_encoding($str);

// "auto" раскрывается в соответствии с mbstring.language
echo mb_detect_encoding($str, "auto");

// Зададим список кодировок "encodings" в виде строки
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");

// Использование массива для задания возможных кодировок "encodings"
$encodings = [
"ASCII",
"JIS",
"EUC-JP"
];
echo
mb_detect_encoding($str, $encodings);
?>

Пример #2 Действие параметра strict

<?php
// 'áéóú' закодирована в ISO-8859-1
$str = "\xE1\xE9\xF3\xFA";

// Строка недействительна в ASCII или UTF-8, но UTF-8 считается более близким соответствием
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], true));

// Если допустимая кодировка найдена, параметр strict не меняет результат
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], true));
?>

Результат выполнения данного примера:

string(5) "UTF-8"
bool(false)
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"

В некоторых случаях одна и та же последовательность байтов может образовывать допустимую строку в нескольких кодировках символов, и невозможно узнать, какая интерпретация предназначалась. Например, среди многих других байтовая последовательность "\xC4\xA2" может быть:

  • "Ä¢" (U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS с последующим U+00A2 CENT SIGN) закодирована в ISO-8859-1, ISO-8859-15 или Windows-1252
  • "ФЂ" (U+0424 CYRILLIC CAPITAL LETTER EF с последующим U+0402 CYRILLIC CAPITAL LETTER DJE) закодирована в ISO-8859-5
  • "Ģ" (U+0122 LATIN CAPITAL LETTER G WITH CEDILLA) закодирована в UTF-8

Пример #3 Использование порядка при совпадении нескольких кодировок

<?php
$str
= "\xC4\xA2";

// Строка действительна во всех трех кодировках, поэтому будет возвращена первая из перечисленных кодировок.
var_dump(mb_detect_encoding($str, ['UTF-8', 'ISO-8859-1', 'ISO-8859-5']));
var_dump(mb_detect_encoding($str, ['ISO-8859-1', 'ISO-8859-5', 'UTF-8']));
var_dump(mb_detect_encoding($str, ['ISO-8859-5', 'UTF-8', 'ISO-8859-1']));
?>

Результат выполнения данного примера:

string(5) "UTF-8"
string(10) "ISO-8859-1"
string(10) "ISO-8859-5"

Смотрите также

  • mb_detect_order() - Установка/получение списка кодировок для механизмов определения кодировки

add a note

User Contributed Notes 26 notes

up
85
Gerg Tisza
12 years ago
If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.

<?php
$str
= 'áéóú'; // ISO-8859-1
mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
mb_detect_encoding($str, 'UTF-8', true); // false
?>
up
10
mta59066 at gmail dot com
1 year ago
The documentation is no longer correct for php8.1 and mb_detect_encoding no longer supports order of encodings. The example outputs given in the documentation are also no longer correct for php8.1. This is somewhat explained here https://github.com/php/php-src/issues/8279

I understand the previous ambiguity in these functions, but in my option 8.1 should have deprecated mb_detect_encoding and mb_detect_order and came up with different functions. It now tries to find the encoding that will use the least amount of space regardless of the order, and I am not sure who needs that.

Below is an example function that will do what mb_detect_encoding was doing prior to the 8.1 change.

<?php

function mb_detect_enconding_in_order(string $string, array $encodings): string|false
{
foreach(
$encodings as $enc) {
if (
mb_check_encoding($string, $enc)) {
return
$enc;
}
}
return
false;
}

?>
up
5
geompse at gmail dot com
10 months ago
Major undocumented breaking change since 8.1.7
https://3v4l.org/BLjZ3

Make sure to replace mb_detect_encoding with a loop of calls to mb_check_encoding
up
22
Chrigu
18 years ago
If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:
mb_detect_encoding($string, 'UTF-8, ISO-8859-1');

if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
up
12
dennis at nikolaenko dot ru
14 years ago
Beware of bug to detect Russian encodings
http://bugs.php.net/bug.php?id=38138
up
17
chris AT w3style.co DOT uk
17 years ago
Based upon that snippet below using preg_match() I needed something faster and less specific. That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8. I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.

I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string. This is quite a lot faster.

<?php

function detectUTF8($string)
{
return
preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)+%xs'
, $string);
}

?>
up
5
rl at itfigures dot nl
16 years ago
I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.

The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice that \x80 is used as the euro-sign in the 8859-1 charset.

I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:

if(detectUTF8($str)){
$str=str_replace("\xE2\x82\xAC","&euro;",$str);
$str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);
$str=str_replace("&euro;","\x80",$str);
}

If html-output is needed the last line is not necessary (and even unwanted).
up
15
nat3738 at gmail dot com
14 years ago
A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)

<?php
// Unicode BOM is U+FEFF, but after encoded, it will look like this.
define ('UTF32_BIG_ENDIAN_BOM' , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));
define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));
define ('UTF16_BIG_ENDIAN_BOM' , chr(0xFE) . chr(0xFF));
define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE));
define ('UTF8_BOM' , chr(0xEF) . chr(0xBB) . chr(0xBF));

function
detect_utf_encoding($filename) {

$text = file_get_contents($filename);
$first2 = substr($text, 0, 2);
$first3 = substr($text, 0, 3);
$first4 = substr($text, 0, 3);

if (
$first3 == UTF8_BOM) return 'UTF-8';
elseif (
$first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';
elseif (
$first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';
elseif (
$first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';
elseif (
$first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';
}
?>
up
6
eyecatchup at gmail dot com
10 years ago
Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity:

<?php
if (preg_match("//u", $string)) {
// $string is valid UTF-8
}
up
6
hmdker at gmail dot com
15 years ago
Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.

<?php
function is_utf8($str) {
$c=0; $b=0;
$bits=0;
$len=strlen($str);
for(
$i=0; $i<$len; $i++){
$c=ord($str[$i]);
if(
$c > 128){
if((
$c >= 254)) return false;
elseif(
$c >= 252) $bits=6;
elseif(
$c >= 248) $bits=5;
elseif(
$c >= 240) $bits=4;
elseif(
$c >= 224) $bits=3;
elseif(
$c >= 192) $bits=2;
else return
false;
if((
$i+$bits) > $len) return false;
while(
$bits > 1){
$i++;
$b=ord($str[$i]);
if(
$b < 128 || $b > 191) return false;
$bits--;
}
}
}
return
true;
}
?>
up
1
recentUser at example dot com
5 years ago
In my environment (PHP 7.1.12),
"mb_detect_encoding()" doesn't work
where "mb_detect_order()" is not set appropriately.

To enable "mb_detect_encoding()" to work in such a case,
simply put "mb_detect_order('...')"
before "mb_detect_encoding()" in your script file.

Both
"ini_set('mbstring.language', '...');"
and
"ini_set('mbstring.detect_order', '...');"
DON'T work in script files for this purpose
whereas setting them in PHP.INI file may work.
up
2
garbage at iglou dot eu
6 years ago
For detect UTF-8, you can use:

if (preg_match('!!u', $str)) { echo 'utf-8'; }

- Norihiori
up
2
php-note-2005 at ryandesign dot com
18 years ago
Much simpler UTF-8-ness checker using a regular expression created by the W3C:

<?php

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {

// From http://w3.org/International/questions/qa-forms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs'
, $string);

}
// function is_utf8

?>
up
2
emoebel at web dot de
9 years ago
if the function " mb_detect_encoding" does not exist ...

... try:

<?php
// ----------------------------------------------------
if ( !function_exists('mb_detect_encoding') ) {

// ----------------------------------------------------------------
function mb_detect_encoding ($string, $enc=null, $ret=null) {

static
$enclist = array(
'UTF-8', 'ASCII',
'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5',
'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10',
'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16',
'Windows-1251', 'Windows-1252', 'Windows-1254',
);

$result = false;

foreach (
$enclist as $item) {
$sample = iconv($item, $item, $string);
if (
md5($sample) == md5($string)) {
if (
$ret === NULL) { $result = $item; } else { $result = true; }
break;
}
}

return
$result;
}
// ----------------------------------------------------------------

}
// ----------------------------------------------------
?>

example / usage of: mb_detect_encoding()

<?php
// ------------------------------------------------------
function str_to_utf8 ($str) {

if (
mb_detect_encoding($str, 'UTF-8', true) === false) {
$str = utf8_encode($str);
}

return
$str;
}
// ------------------------------------------------------
?>

$txtstr = str_to_utf8($txtstr);
up
1
bmrkbyet at web dot de
10 years ago
a) if the FUNCTION mb_detect_encoding is not available:

### mb_detect_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_detect_encoding')) {
function
mb_detect_encoding($string, $enc=null) {

static
$list = array('utf-8', 'iso-8859-1', 'windows-1251');

foreach (
$list as $item) {
$sample = iconv($item, $item, $string);
if (
md5($sample) == md5($string)) {
if (
$enc == $item) { return true; } else { return $item; }
}
}
return
null;
}
}

// -------------------------------------------
?>

b) if the FUNCTION mb_convert_encoding is not available:

### mb_convert_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_convert_encoding')) {
function
mb_convert_encoding($string, $target_encoding, $source_encoding) {
$string = iconv($source_encoding, $target_encoding, $string);
return
$string;
}
}

// -------------------------------------------
?>
up
1
maarten
18 years ago
Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.
To verify utf 8 use the following:

//
// utf8 encoding validation developed based on Wikipedia entry at:
// http://en.wikipedia.org/wiki/UTF-8
//
// Implemented as a recursive descent parser based on a simple state machine
// copyright 2005 Maarten Meijer
//
// This cries out for a C-implementation to be included in PHP core
//
function valid_1byte($char) {
if(!is_int($char)) return false;
return ($char & 0x80) == 0x00;
}

function valid_2byte($char) {
if(!is_int($char)) return false;
return ($char & 0xE0) == 0xC0;
}

function valid_3byte($char) {
if(!is_int($char)) return false;
return ($char & 0xF0) == 0xE0;
}

function valid_4byte($char) {
if(!is_int($char)) return false;
return ($char & 0xF8) == 0xF0;
}

function valid_nextbyte($char) {
if(!is_int($char)) return false;
return ($char & 0xC0) == 0x80;
}

function valid_utf8($string) {
$len = strlen($string);
$i = 0;
while( $i < $len ) {
$char = ord(substr($string, $i++, 1));
if(valid_1byte($char)) { // continue
continue;
} else if(valid_2byte($char)) { // check 1 byte
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} else if(valid_3byte($char)) { // check 2 bytes
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} else if(valid_4byte($char)) { // check 3 bytes
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} // goto next char
}
return true; // done
}

for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png
up
0
d_maksimov
1 year ago
It was helpful for my exec(...) call. When it returned cp866 or cp1251:

try {
$line = iconv('CP866', 'CP1251', $line);
} catch(Exception $e) {
}
return iconv('CP1251', 'UTF-8', $line);
up
-1
telemach
18 years ago
beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)

mb_detect_encoding('accentu?e' , 'UTF-8, ISO-8859-1')

returns ISO-8859-1, while

mb_detect_encoding('accentu?' , 'UTF-8, ISO-8859-1')

returns UTF-8

bottom line : an ending '?' (and probably other accentuated chars) mislead mb_detect_encoding
up
-1
lotushzy at gmail dot com
5 years ago
About function mb_detect_encoding, the link http://php.net/manual/zh/function.mb-detect-encoding.php , like this:
mb_detect_encoding('áéóú', 'UTF-8', true); // false
but now the result is not false, can you give me reason, thanks!
up
-5
Anonymous
9 years ago
// -----------------------------------------------------------

if(!function_exists('mb_detect_encoding')) {

function mb_detect_encoding($string, $enc=null, $ret=true) {
$out=$enc;
static $list = array('utf-8', 'iso-8859-1', 'iso-8859-15', 'windows-1251');
foreach ($list as $item) {
$sample = iconv($item, $item, $string);
if (md5($sample) == md5($string)) { $out = ($ret !== false) ? true : $item; }
}
return $out;
}

}

// -----------------------------------------------------------
up
-4
yaqy at qq dot com
15 years ago
<?php
/*
*QQ: 290359552
* conver to Utf8 if $str is not equals to 'UTF-8'
*/
function convToUtf8($str)
{
if(
mb_detect_encoding($str,"UTF-8, ISO-8859-1, GBK")!="UTF-8" )
{

return
iconv("gbk","utf-8",$str);

}
else
{
return
$str;
}

}
?>
up
-6
matthijs at ischen dot nl
14 years ago
I seriously underestimated the importance of setlocale...
<?php
$strings
= array(
"mais coisas a pensar sobre diário ou dois!",
"plus de choses à penser à journalier ou à deux !",
"¡más cosas a pensar en diario o dos!",
"più cose da pensare circa giornaliere o due!",
"flere ting å tenke på hver dag eller to!",
"Další věcí, přemýšlet o každý den nebo dva!",
"mehr über Spaß spät schönen",
"më vonë gjatë fun bukur",
"több mint szórakozás késő csodálatos kenyér"
);

$convert = array();
setlocale(LC_CTYPE, 'de_DE.UTF-8');
foreach(
$strings as $string )
$convert[] = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
?>

Produces the following:

Array
(
[0] => mais coisas a pensar sobre diario ou dois!
[1] => plus de choses a penser a journalier ou a deux !
[2] => ?mas cosas a pensar en diario o dos!
[3] => piu cose da pensare circa giornaliere o due!
[4] => flere ting aa tenke paa hver dag eller to!
[5] => Dalsi veci, premyslet o kazdy den nebo dva!
[6] => mehr ueber Spass spaet schoenen
[7] => me vone gjate fun bukur
[8] => toebb mint szorakozas keso csodalatos kenyer
)

whereas

<?php
$convert
= array();
setlocale(LC_CTYPE, 'nl_NL.UTF-8');
foreach(
$strings as $string )
$convert[] = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
?>

produces:
Array
(
[0] => mais coisas a pensar sobre di?rio ou dois!
[1] => plus de choses ? penser ? journalier ou ? deux !
[2] => ?m?s cosas a pensar en diario o dos!
[3] => pi? cose da pensare circa giornaliere o due!
[4] => flere ting ? tenke p? hver dag eller to!
[5] => Dal?? v?c?, p?em??let o ka?d? den nebo dva!
[6] => mehr ?ber Spass sp?t sch?nen
[7] => m? von? gjat? fun bukur
[8] => t?bb mint sz?rakoz?s k?s? csod?latos keny?r
)

This might be of interest when trying to convert utf-8 strings into ASCII suitable for URL's, and such. this was never obvious for me since I've used locales for us and nl.
up
-5
jaaks at playtech dot com
18 years ago
Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it.

Replace
} // goto next char
with
} else {
return false; // 10xxxxxx occuring alone
} // goto next char
up
-5
lexonight at yahoo dot com
6 years ago
<?php
$file
= file_get_contents("somefile.txt");
$encodings = implode(',', mb_list_encodings());
echo
mb_detect_encoding($file, $encodings, true);
?>
seems to work
up
-11
prgss at bk dot ru
14 years ago
Another light way to detect character encoding:
<?php
function detect_encoding($string) {
static
$list = array('utf-8', 'windows-1251');

foreach (
$list as $item) {
$sample = iconv($item, $item, $string);
if (
md5($sample) == md5($string))
return
$item;
}
return
null;
}
?>
up
-13
sunggsun
17 years ago
from PHPDIG

function isUTF8($str) {
if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) {
return true;
} else {
return false;
}
}
To Top