mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5, PHP 7, PHP 8)

mb_detect_encoding — Определяет кодировку символов

Описание

function mb_detect_encoding(string $string, array|string|null $encodings = null, bool $strict = false): string|false

Функция определяет наиболее вероятную кодировку символов значения с типом string в параметре string путём проверки списка кандидатов по порядку.

Начиная с PHP 8.1 функция возвращает не первую возможную кодировку, а проверяет каждую допустимую кодировку в списке encodings и эвристически определяет наиболее правильную.

Надёжность автоматического определения предполагаемой кодировки символов не достигает 100 %; без дополнительной информации это похоже на расшифровку зашифрованной строки без ключа. Лучше явно указать кодировку символов, которая хранится или передаётся с данными, например в HTTP-заголовке Content-Type.

Функция полезнее при вызове с многобайтовыми кодировками, поскольку не каждая последовательность байтов образует допустимую строку. Функция отклонит кодировку, если входная строка содержит такую последовательность.

Внимание

Неточность результата

Название функции вводит в заблуждение: функция «угадывает» кодировку, а не «обнаруживает».

Догадки неточны, поэтому функцией невозможно точно определить правильную кодировку символов.

Список параметров

string

Проверяемая строка (string).

encodings

Список кодировок символов для проверки. Список определяется как массив строк или как строка со списком кодировок через запятую.

При пропуске параметра encodings или установке для параметра значения null выбирается текущий порядок определения кодировки, который установили в директиве mbstring.detect_order настроек конфигурации или функцией mb_detect_order().

strict

Управляет поведением, когда строка в параметре string недопустима ни для одной перечисленной в параметре encodings кодировки. При передаче в параметр strict значения false возвращается первая совпавшая кодировка; при установке для параметра strict значения true возвращается значение false.

Значение по умолчанию для параметра strict также устанавливается в директиве mbstring.strict_detection настроек конфигурации.

Возвращаемые значения

Функция возвращает кодировку символов, которую обнаружила, или false, если строка недопустима ни для одной из перечисленных кодировок.

Список изменений

Версия	Описание
8.2.0	Функция mb_detect_encoding() больше не возвращает следующие нетекстовые кодировки: `"Base64"`, `"QPrint"`, `"UUencode"`, `"HTML entities"`, `"7 bit"` и `"8 bit"`.

Примеры

Пример #1 Пример определения кодировки функцией mb_detect_encoding()

<?php

$str = "\x95\xB6\x8E\x9A\x83\x52\x81\x5B\x83\x68";

// Определение кодировки символов с текущим порядком определения
var_dump(mb_detect_encoding($str));

// Значение "auto" раскрывается в соответствии с директивой mbstring.language
var_dump(mb_detect_encoding($str, "auto"));

// Установка параметра "encodings" списком значений через запятую
var_dump(mb_detect_encoding($str, "JIS, eucjp-win, sjis-win"));

// Установка параметра "encodings" массивом
$encodings = [
  "ASCII",
  "JIS",
  "EUC-JP"
];
var_dump(mb_detect_encoding($str, $encodings));

Результат выполнения приведённого примера:

string(5) "ASCII"
string(5) "ASCII"
string(8) "SJIS-win"
string(5) "ASCII"

Пример #2 Действие параметра strict

<?php

// Строка "áéóú" в кодировке ISO-8859-1
$str = "\xE1\xE9\xF3\xFA";

// Строка недопустима для кодировок ASCII или UTF-8, но UTF-8 считается более близким соответствием
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], true));

// При обнаружении допустимой кодировки параметр strict не изменяет результат
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], true));

Результат выполнения приведённого примера:

string(5) "UTF-8"
bool(false)
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"

Иногда одна и та же последовательность байтов образовывает допустимую строку в нескольких кодировках символов, и невозможно узнать, какая интерпретация подразумевалась. Например, среди многих других байтовая последовательность "\xC4\xA2" допустима для:

«Ä¢» (U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS с последующим U+00A2 CENT SIGN), закодированная в одной из кодировок — ISO-8859-1, ISO-8859-15 или Windows-1252
«ФЂ» (U+0424 CYRILLIC CAPITAL LETTER EF с последующим U+0402 CYRILLIC CAPITAL LETTER DJE), закодированная в ISO-8859-5
«Ģ» (U+0122 LATIN CAPITAL LETTER G WITH CEDILLA), закодированная в UTF-8

Пример #3 Действие порядка кодировок при совпадении нескольких кандидатов

<?php

$str = "\xC4\xA2";

// Строка действительна в каждой из трёх кодировок,
// но вернётся не первая возможная, а наиболее корректная
var_dump(mb_detect_encoding($str, ['UTF-8']));
var_dump(mb_detect_encoding($str, ['UTF-8', 'ISO-8859-1', 'ISO-8859-5'])); // С php8.1 вернётся кодировка ISO-8859-1, а не UTF-8
var_dump(mb_detect_encoding($str, ['ISO-8859-1', 'ISO-8859-5', 'UTF-8']));
var_dump(mb_detect_encoding($str, ['ISO-8859-5', 'UTF-8', 'ISO-8859-1']));

?>

Результат выполнения приведённого примера:

string(5) "UTF-8"
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"
string(10) "ISO-8859-5"

Смотрите также

mb_detect_order() - Устанавливает или получает порядок определения кодировки символов

Нашли ошибку?

Инструкция • Исправление • Сообщение об ошибке

＋Добавить

Примечания пользователей 19 notes

down

Gerg Tisza ¶

15 years ago

If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.

<?php
    $str = 'áéóú'; // ISO-8859-1
    mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
    mb_detect_encoding($str, 'UTF-8', true); // false
?>

down

mta59066 at gmail dot com ¶

3 years ago

The documentation is no longer correct for php8.1 and mb_detect_encoding no longer supports order of encodings. The example outputs given in the documentation are also no longer correct for php8.1. This is somewhat explained here https://github.com/php/php-src/issues/8279

I understand the previous ambiguity in these functions, but in my option 8.1 should have deprecated mb_detect_encoding and mb_detect_order and came up with different functions. It now tries to find the encoding that will use the least amount of space regardless of the order, and I am not sure who needs that.

Below is an example function that will do what mb_detect_encoding was doing prior to the 8.1 change.

<?php

function mb_detect_enconding_in_order(string $string, array $encodings): string|false
{
    foreach($encodings as $enc) {
        if (mb_check_encoding($string, $enc)) {
            return $enc;
        }
    }
    return false;
}

?>

down

geompse at gmail dot com ¶

3 years ago

Major undocumented breaking change since 8.1.7
https://3v4l.org/BLjZ3

Make sure to replace mb_detect_encoding with a loop of calls to mb_check_encoding

down

Chrigu ¶

21 years ago

If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:
mb_detect_encoding($string, 'UTF-8, ISO-8859-1');

if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.

down

chris AT w3style.co DOT uk ¶

19 years ago

Based upon that snippet below using preg_match() I needed something faster and less specific.  That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8.  I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.

I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string.  This is quite a lot faster.

<?php

function detectUTF8($string)
{
        return preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
        )+%xs', $string);
}

?>

down

nat3738 at gmail dot com ¶

17 years ago

A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)

<?php
// Unicode BOM is U+FEFF, but after encoded, it will look like this.
define ('UTF32_BIG_ENDIAN_BOM'   , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));
define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));
define ('UTF16_BIG_ENDIAN_BOM'   , chr(0xFE) . chr(0xFF));
define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE));
define ('UTF8_BOM'               , chr(0xEF) . chr(0xBB) . chr(0xBF));

function detect_utf_encoding($filename) {

    $text = file_get_contents($filename);
    $first2 = substr($text, 0, 2);
    $first3 = substr($text, 0, 3);
    $first4 = substr($text, 0, 3);
    
    if ($first3 == UTF8_BOM) return 'UTF-8';
    elseif ($first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';
    elseif ($first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';
    elseif ($first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';
    elseif ($first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';
}
?>

down

dennis at nikolaenko dot ru ¶

17 years ago

Beware of bug to detect Russian encodings
http://bugs.php.net/bug.php?id=38138

down

rl at itfigures dot nl ¶

18 years ago

I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.

The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice  that \x80 is used as the euro-sign in the 8859-1 charset. 

I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:

if(detectUTF8($str)){
  $str=str_replace("\xE2\x82\xAC","&euro;",$str); 
  $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);
  $str=str_replace("&euro;","\x80",$str); 
}

If html-output is needed the last line is not necessary (and even unwanted).

down

eyecatchup at gmail dot com ¶

13 years ago

Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity:

<?php
  if (preg_match("//u", $string)) {
      // $string is valid UTF-8
  }

down

hmdker at gmail dot com ¶

17 years ago

Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.

<?php
function is_utf8($str) {
    $c=0; $b=0;
    $bits=0;
    $len=strlen($str);
    for($i=0; $i<$len; $i++){
        $c=ord($str[$i]);
        if($c > 128){
            if(($c >= 254)) return false;
            elseif($c >= 252) $bits=6;
            elseif($c >= 248) $bits=5;
            elseif($c >= 240) $bits=4;
            elseif($c >= 224) $bits=3;
            elseif($c >= 192) $bits=2;
            else return false;
            if(($i+$bits) > $len) return false;
            while($bits > 1){
                $i++;
                $b=ord($str[$i]);
                if($b < 128 || $b > 191) return false;
                $bits--;
            }
        }
    }
    return true;
}
?>

down

php-note-2005 at ryandesign dot com ¶

21 years ago

Much simpler UTF-8-ness checker using a regular expression created by the W3C:

<?php

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
    
    // From http://w3.org/International/questions/qa-forms-utf-8.html
    return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs', $string);
    
} // function is_utf8

?>

down

garbage at iglou dot eu ¶

9 years ago

For detect UTF-8, you can use:

if (preg_match('!!u', $str)) { echo 'utf-8'; }

- Norihiori

down

-2

d_maksimov ¶

4 years ago

It was helpful for my exec(...) call. When it returned cp866 or cp1251:

try {
    $line = iconv('CP866', 'CP1251', $line);
} catch(Exception $e) {
}
return iconv('CP1251', 'UTF-8', $line);

down

emoebel at web dot de ¶

12 years ago

if the  function " mb_detect_encoding" does not exist  ... 

... try: 

<?php 
// ---------------------------------------------------- 
if ( !function_exists('mb_detect_encoding') ) { 

// ---------------------------------------------------------------- 
function mb_detect_encoding ($string, $enc=null, $ret=null) { 
       
        static $enclist = array( 
            'UTF-8', 'ASCII', 
            'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5', 
            'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 
            'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16', 
            'Windows-1251', 'Windows-1252', 'Windows-1254', 
            );
        
        $result = false; 
        
        foreach ($enclist as $item) { 
            $sample = iconv($item, $item, $string); 
            if (md5($sample) == md5($string)) { 
                if ($ret === NULL) { $result = $item; } else { $result = true; } 
                break; 
            }
        }
        
    return $result; 
} 
// ---------------------------------------------------------------- 

} 
// ---------------------------------------------------- 
?>

example / usage of: mb_detect_encoding() 

<?php 
// ------------------------------------------------------ 
function str_to_utf8 ($str) { 
    
    if (mb_detect_encoding($str, 'UTF-8', true) === false) { 
    $str = utf8_encode($str); 
    }

    return $str;
}
// ------------------------------------------------------ 
?>

$txtstr = str_to_utf8($txtstr);

down

maarten ¶

21 years ago

Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.
To verify utf 8 use the following:

//
//    utf8 encoding validation developed based on Wikipedia entry at:
//    http://en.wikipedia.org/wiki/UTF-8
//
//    Implemented as a recursive descent parser based on a simple state machine
//    copyright 2005 Maarten Meijer
//
//    This cries out for a C-implementation to be included in PHP core
//
    function valid_1byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0x80) == 0x00;
    }
    
    function valid_2byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xE0) == 0xC0;
    }

    function valid_3byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF0) == 0xE0;
    }

    function valid_4byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF8) == 0xF0;
    }
    
    function valid_nextbyte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xC0) == 0x80;
    }
    
    function valid_utf8($string) {
        $len = strlen($string);
        $i = 0;    
        while( $i < $len ) {
            $char = ord(substr($string, $i++, 1));
            if(valid_1byte($char)) {    // continue
                continue;
            } else if(valid_2byte($char)) { // check 1 byte
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_3byte($char)) { // check 2 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_4byte($char)) { // check 3 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } // goto next char
        }
        return true; // done
    }

for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png

down

-1

bmrkbyet at web dot de ¶

13 years ago

a) if the FUNCTION mb_detect_encoding is not available: 

### mb_detect_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_detect_encoding')) { 
function mb_detect_encoding($string, $enc=null) { 
    
    static $list = array('utf-8', 'iso-8859-1', 'windows-1251');
    
    foreach ($list as $item) {
        $sample = iconv($item, $item, $string);
        if (md5($sample) == md5($string)) { 
            if ($enc == $item) { return true; }    else { return $item; } 
        }
    }
    return null;
}
}

// -------------------------------------------
?>

b) if the FUNCTION mb_convert_encoding is not available: 

### mb_convert_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_convert_encoding')) { 
function mb_convert_encoding($string, $target_encoding, $source_encoding) { 
    $string = iconv($source_encoding, $target_encoding, $string); 
    return $string; 
}
}

// -------------------------------------------
?>

down

-1

telemach ¶

21 years ago

beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)

mb_detect_encoding('accentu?e' , 'UTF-8, ISO-8859-1')

returns ISO-8859-1, while 

mb_detect_encoding('accentu?' , 'UTF-8, ISO-8859-1')

returns UTF-8

bottom line : an ending '?' (and probably other accentuated chars) mislead mb_detect_encoding

down

-1

recentUser at example dot com ¶

8 years ago

In my environment (PHP 7.1.12),
"mb_detect_encoding()" doesn't work
     where "mb_detect_order()" is not set appropriately.

To enable "mb_detect_encoding()" to work in such a case,
     simply put "mb_detect_order('...')"
     before "mb_detect_encoding()" in your script file.

Both 
     "ini_set('mbstring.language', '...');"
     and
     "ini_set('mbstring.detect_order', '...');"
DON'T work in script files for this purpose
whereas setting them in PHP.INI file may work.

down

-3

lotushzy at gmail dot com ¶

8 years ago

About function mb_detect_encoding, the link http://php.net/manual/zh/function.mb-detect-encoding.php , like this:
mb_detect_encoding('áéóú', 'UTF-8', true); // false
but now the result is not false, can you give me reason, thanks!

＋Добавить