mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5, PHP 7, PHP 8)

mb_detect_encoding — Detectar a codificação de caracteres

Descrição

function mb_detect_encoding(string $string, array|string|null $encodings = null, bool $strict = false): string|false

Detecta a codificação de caracteres mais provável para a string string a partir de uma lista de candidatos.

A partir do PHP 8.1, esta função usa heurística para detectar qual das codificações de texto válidas na lista especificada tem maior probabilidade de estar correta e pode não estar na ordem das codificações fornecidas no parâmetro encodings.

A detecção automática da codificação de caracteres pretendida nunca pode ser totalmente confiável; sem informações adicionais, é semelhante a decodificar uma string criptografada sem a chave. É sempre preferível usar uma indicação de codificação de caracteres armazenada ou transmitida com os dados, como um cabeçalho HTTP "Content-Type".

Esta função é mais útil com codificações multibyte, onde nem todas as sequências de bytes formam uma string válida. Se a string de entrada contiver tal sequência, essa codificação será rejeitada.

Aviso

O resultado não é preciso

O nome dessa função é enganoso: ela realiza "adivinhação" em vez de "detecção".

Os palpites estão longe de ser precisos e, portanto, esta função não pode ser usada para detectar com precisão a codificação correta de caracteres.

Parâmetros

string

A string sendo inspecionada.

encodings

Uma lista de codificações de caracteres para tentar. A lista pode ser especificada como um array de strings ou uma única string separada por vírgulas.

Se encodings for omitido ou null, a detect_order atual (definida com a opção de configuração mbstring.detect_order, ou função mb_detect_order()) será usada.

strict

Controla o comportamento quando string não é válida em nenhuma das encodings listadas. Se strict for definido como false, a codificação mais próxima correspondente será retornada; se strict for definido como true, false será retornado.

O valor padrão para strict pode ser definido com a opção de configuração mbstring.strict_detection.

Valor Retornado

A codificação de caracteres detectada ou false se a string não for válida em nenhuma das codificações listadas.

Registro de Alterações

Versão	Descrição
8.2.0	mb_detect_encoding() não retornará mais as seguintes codificações não textuais: `"Base64"`, `"QPrint"`, `"UUencode"`, `"HTML entities"`, `"7 bit"` e `"8 bit"`.

Exemplos

Exemplo #1 Exemplo de mb_detect_encoding()

<?php

$str = "\x95\xB6\x8E\x9A\x83\x52\x81\x5B\x83\x68";

// Detecta codificação de caracteres com a ordem de detecção atual
var_dump(mb_detect_encoding($str));

// "auto" é expandido de acordo com mbstring.language
var_dump(mb_detect_encoding($str, "auto"));

// Especifica parâmetro "encodings" com lista separada por vírgula
var_dump(mb_detect_encoding($str, "JIS, eucjp-win, sjis-win"));

// Usa array para especificar parâmetro "encodings"
$encodings = [
  "ASCII",
  "JIS",
  "EUC-JP"
];
var_dump(mb_detect_encoding($str, $encodings));
?>

O exemplo acima produzirá:

string(5) "ASCII"
string(5) "ASCII"
string(8) "SJIS-win"
string(5) "ASCII"

Exemplo #2 Efeito do parâmetro strict

<?php
// 'áéóú' codificado em ISO-8859-1
$str = "\xE1\xE9\xF3\xFA";

// A string não é válida em ASCII ou UTF-8, mas UTF-8 é considerado uma correspondência mais próxima
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], true));

// Se uma codificação válida for encontrada, o parâmetro strict não muda o resultado
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], true));
?>

O exemplo acima produzirá:

string(5) "UTF-8"
bool(false)
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"

Em alguns casos, a mesma sequência de bytes pode formar uma string válida em várias codificações de caracteres, e é impossível saber qual interpretação foi pretendida. Por exemplo, entre muitos outros, a sequência de bytes "\xC4\xA2" poderia ser:

"Ä¢" (U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS seguido por U+00A2 CENT SIGN) codificado em qualquer um de of ISO-8859-1, ISO-8859-15, ou Windows-1252
"ФЂ" (U+0424 CYRILLIC CAPITAL LETTER EF seguido por U+0402 CYRILLIC CAPITAL LETTER DJE) codificado em ISO-8859-5
"Ģ" (U+0122 LATIN CAPITAL LETTER G WITH CEDILLA) codificado em UTF-8

Exemplo #3 Efeito da ordem quando várias codificações correspondem

<?php
$str = "\xC4\xA2";

// A string é válida em todas as três codificações, mas a primeira listada pode nem sempre ser a retornada
var_dump(mb_detect_encoding($str, ['UTF-8']));
var_dump(mb_detect_encoding($str, ['UTF-8', 'ISO-8859-1', 'ISO-8859-5'])); // a partir do PHP 8.1, retorna ISO-8859-1 em vez de UTF-8
var_dump(mb_detect_encoding($str, ['ISO-8859-1', 'ISO-8859-5', 'UTF-8']));
var_dump(mb_detect_encoding($str, ['ISO-8859-5', 'UTF-8', 'ISO-8859-1']));
?>

O exemplo acima produzirá:

string(5) "UTF-8"
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"
string(10) "ISO-8859-5"

Veja Também

mb_detect_order() - Define ou obtém a ordem de detecção de codificação de caracteres

Melhore Esta Página

Aprenda Como Melhorar Esta Página • Envie uma Solicitação de Modificação • Reporte um Problema

＋adicionar nota

Notas de Usuários 19 notes

down

Gerg Tisza ¶

15 years ago

If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.

<?php
    $str = 'áéóú'; // ISO-8859-1
    mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
    mb_detect_encoding($str, 'UTF-8', true); // false
?>

down

mta59066 at gmail dot com ¶

3 years ago

The documentation is no longer correct for php8.1 and mb_detect_encoding no longer supports order of encodings. The example outputs given in the documentation are also no longer correct for php8.1. This is somewhat explained here https://github.com/php/php-src/issues/8279

I understand the previous ambiguity in these functions, but in my option 8.1 should have deprecated mb_detect_encoding and mb_detect_order and came up with different functions. It now tries to find the encoding that will use the least amount of space regardless of the order, and I am not sure who needs that.

Below is an example function that will do what mb_detect_encoding was doing prior to the 8.1 change.

<?php

function mb_detect_enconding_in_order(string $string, array $encodings): string|false
{
    foreach($encodings as $enc) {
        if (mb_check_encoding($string, $enc)) {
            return $enc;
        }
    }
    return false;
}

?>

down

geompse at gmail dot com ¶

3 years ago

Major undocumented breaking change since 8.1.7
https://3v4l.org/BLjZ3

Make sure to replace mb_detect_encoding with a loop of calls to mb_check_encoding

down

Chrigu ¶

21 years ago

If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:
mb_detect_encoding($string, 'UTF-8, ISO-8859-1');

if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.

down

chris AT w3style.co DOT uk ¶

19 years ago

Based upon that snippet below using preg_match() I needed something faster and less specific.  That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8.  I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.

I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string.  This is quite a lot faster.

<?php

function detectUTF8($string)
{
        return preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
        )+%xs', $string);
}

?>

down

nat3738 at gmail dot com ¶

17 years ago

A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)

<?php
// Unicode BOM is U+FEFF, but after encoded, it will look like this.
define ('UTF32_BIG_ENDIAN_BOM'   , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));
define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));
define ('UTF16_BIG_ENDIAN_BOM'   , chr(0xFE) . chr(0xFF));
define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE));
define ('UTF8_BOM'               , chr(0xEF) . chr(0xBB) . chr(0xBF));

function detect_utf_encoding($filename) {

    $text = file_get_contents($filename);
    $first2 = substr($text, 0, 2);
    $first3 = substr($text, 0, 3);
    $first4 = substr($text, 0, 3);
    
    if ($first3 == UTF8_BOM) return 'UTF-8';
    elseif ($first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';
    elseif ($first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';
    elseif ($first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';
    elseif ($first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';
}
?>

down

dennis at nikolaenko dot ru ¶

17 years ago

Beware of bug to detect Russian encodings
http://bugs.php.net/bug.php?id=38138

down

rl at itfigures dot nl ¶

18 years ago

I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.

The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice  that \x80 is used as the euro-sign in the 8859-1 charset. 

I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:

if(detectUTF8($str)){
  $str=str_replace("\xE2\x82\xAC","&euro;",$str); 
  $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);
  $str=str_replace("&euro;","\x80",$str); 
}

If html-output is needed the last line is not necessary (and even unwanted).

down

eyecatchup at gmail dot com ¶

13 years ago

Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity:

<?php
  if (preg_match("//u", $string)) {
      // $string is valid UTF-8
  }

down

hmdker at gmail dot com ¶

17 years ago

Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.

<?php
function is_utf8($str) {
    $c=0; $b=0;
    $bits=0;
    $len=strlen($str);
    for($i=0; $i<$len; $i++){
        $c=ord($str[$i]);
        if($c > 128){
            if(($c >= 254)) return false;
            elseif($c >= 252) $bits=6;
            elseif($c >= 248) $bits=5;
            elseif($c >= 240) $bits=4;
            elseif($c >= 224) $bits=3;
            elseif($c >= 192) $bits=2;
            else return false;
            if(($i+$bits) > $len) return false;
            while($bits > 1){
                $i++;
                $b=ord($str[$i]);
                if($b < 128 || $b > 191) return false;
                $bits--;
            }
        }
    }
    return true;
}
?>

down

php-note-2005 at ryandesign dot com ¶

21 years ago

Much simpler UTF-8-ness checker using a regular expression created by the W3C:

<?php

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
    
    // From http://w3.org/International/questions/qa-forms-utf-8.html
    return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs', $string);
    
} // function is_utf8

?>

down

garbage at iglou dot eu ¶

9 years ago

For detect UTF-8, you can use:

if (preg_match('!!u', $str)) { echo 'utf-8'; }

- Norihiori

down

-2

d_maksimov ¶

4 years ago

It was helpful for my exec(...) call. When it returned cp866 or cp1251:

try {
    $line = iconv('CP866', 'CP1251', $line);
} catch(Exception $e) {
}
return iconv('CP1251', 'UTF-8', $line);

down

emoebel at web dot de ¶

12 years ago

if the  function " mb_detect_encoding" does not exist  ... 

... try: 

<?php 
// ---------------------------------------------------- 
if ( !function_exists('mb_detect_encoding') ) { 

// ---------------------------------------------------------------- 
function mb_detect_encoding ($string, $enc=null, $ret=null) { 
       
        static $enclist = array( 
            'UTF-8', 'ASCII', 
            'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5', 
            'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 
            'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16', 
            'Windows-1251', 'Windows-1252', 'Windows-1254', 
            );
        
        $result = false; 
        
        foreach ($enclist as $item) { 
            $sample = iconv($item, $item, $string); 
            if (md5($sample) == md5($string)) { 
                if ($ret === NULL) { $result = $item; } else { $result = true; } 
                break; 
            }
        }
        
    return $result; 
} 
// ---------------------------------------------------------------- 

} 
// ---------------------------------------------------- 
?>

example / usage of: mb_detect_encoding() 

<?php 
// ------------------------------------------------------ 
function str_to_utf8 ($str) { 
    
    if (mb_detect_encoding($str, 'UTF-8', true) === false) { 
    $str = utf8_encode($str); 
    }

    return $str;
}
// ------------------------------------------------------ 
?>

$txtstr = str_to_utf8($txtstr);

down

maarten ¶

21 years ago

Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.
To verify utf 8 use the following:

//
//    utf8 encoding validation developed based on Wikipedia entry at:
//    http://en.wikipedia.org/wiki/UTF-8
//
//    Implemented as a recursive descent parser based on a simple state machine
//    copyright 2005 Maarten Meijer
//
//    This cries out for a C-implementation to be included in PHP core
//
    function valid_1byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0x80) == 0x00;
    }
    
    function valid_2byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xE0) == 0xC0;
    }

    function valid_3byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF0) == 0xE0;
    }

    function valid_4byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF8) == 0xF0;
    }
    
    function valid_nextbyte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xC0) == 0x80;
    }
    
    function valid_utf8($string) {
        $len = strlen($string);
        $i = 0;    
        while( $i < $len ) {
            $char = ord(substr($string, $i++, 1));
            if(valid_1byte($char)) {    // continue
                continue;
            } else if(valid_2byte($char)) { // check 1 byte
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_3byte($char)) { // check 2 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_4byte($char)) { // check 3 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } // goto next char
        }
        return true; // done
    }

for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png

down

-1

bmrkbyet at web dot de ¶

13 years ago

a) if the FUNCTION mb_detect_encoding is not available: 

### mb_detect_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_detect_encoding')) { 
function mb_detect_encoding($string, $enc=null) { 
    
    static $list = array('utf-8', 'iso-8859-1', 'windows-1251');
    
    foreach ($list as $item) {
        $sample = iconv($item, $item, $string);
        if (md5($sample) == md5($string)) { 
            if ($enc == $item) { return true; }    else { return $item; } 
        }
    }
    return null;
}
}

// -------------------------------------------
?>

b) if the FUNCTION mb_convert_encoding is not available: 

### mb_convert_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_convert_encoding')) { 
function mb_convert_encoding($string, $target_encoding, $source_encoding) { 
    $string = iconv($source_encoding, $target_encoding, $string); 
    return $string; 
}
}

// -------------------------------------------
?>

down

-1

telemach ¶

21 years ago

beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)

mb_detect_encoding('accentu?e' , 'UTF-8, ISO-8859-1')

returns ISO-8859-1, while 

mb_detect_encoding('accentu?' , 'UTF-8, ISO-8859-1')

returns UTF-8

bottom line : an ending '?' (and probably other accentuated chars) mislead mb_detect_encoding

down

-1

recentUser at example dot com ¶

8 years ago

In my environment (PHP 7.1.12),
"mb_detect_encoding()" doesn't work
     where "mb_detect_order()" is not set appropriately.

To enable "mb_detect_encoding()" to work in such a case,
     simply put "mb_detect_order('...')"
     before "mb_detect_encoding()" in your script file.

Both 
     "ini_set('mbstring.language', '...');"
     and
     "ini_set('mbstring.detect_order', '...');"
DON'T work in script files for this purpose
whereas setting them in PHP.INI file may work.

down

-3

lotushzy at gmail dot com ¶

8 years ago

About function mb_detect_encoding, the link http://php.net/manual/zh/function.mb-detect-encoding.php , like this:
mb_detect_encoding('áéóú', 'UTF-8', true); // false
but now the result is not false, can you give me reason, thanks!

＋adicionar nota