A simple mb_str_ireplace() implementation - a faster (?) replacement for non-regexp multi-byte string replacement:
<?php
function mb_str_ireplace($co, $naCo, $wCzym)
{
$wCzymM = mb_strtolower($wCzym);
$coM = mb_strtolower($co);
$offset = 0;
while(!is_bool($poz = mb_strpos($wCzymM, $coM, $offset)))
{
$offset = $poz + mb_strlen($naCo);
$wCzym = mb_substr($wCzym, 0, $poz). $naCo .mb_substr($wCzym, $poz+mb_strlen($co));
$wCzymM = mb_strtolower($wCzym);
}
return $wCzym;
}
?>
[thiago - EDITOR NOTE: This function has improvements from d-okumura [aat] fi{dot}kyd[dot]co.jp]
mb_ereg_replace
(PHP 4 >= 4.2.0, PHP 5)
mb_ereg_replace — Reemplaza una expresión regular con soprte multibyte
Descripción
$pattern
, string $replacement
, string $string
[, string $option = "msr"
] )
Explora string para ver si hay coincidencias con
pattern, luego reemplaza el texto coincidente
con replacement.
Parámetros
-
pattern -
El patrón de la expresión regular.
Se pueden usar caracteres multibyte en
pattern. -
replacement -
El texto de sustitución.
-
string -
El string que va a ser comprobado.
-
option -
Las condiciones de comparación se pueden establecer con el parámetro
option. Si se especifica i para este parámetro, no se diferenciará entre mayúsculas/minúsculas. Si se especifica x, se ignorarán los espacios en blanco. Si se especifica m, la comparación se ejecutará en modo multilínea y los saltos de línea estarán incluidos en '.'. Si se especifica p, la compararción se ejecutará en modo POSIX, los saltos de línea serán considerados como caracteres normales. Si se especifica e, el stringreplacementserá evaluado como una expresión de PHP.
Valores devueltos
El string resultante en caso de éxito, o FALSE en caso de error.
Notas
Nota:
La codificación interna o la codificación especificada por mb_regex_encoding() será usada en esta función.
Nunca utilice el modificador e cuando trabaje con datos de entrada que no son de confianza. No se producirá ningún escape automático (como en preg_replace()). Si utiliza el modificador puede crear vulnerabilidades de ejecución remota de código en su aplicación.
Ver también
- mb_regex_encoding() - Establece/obtiene la codificación de caracteres para expresiones regulares multibyte
- mb_eregi_replace() - Reemplaza una expresión regular con soprte multibyte ignorando mayúsculas/minúsculas
You can use \\n for capture group in replacement.
And you can NOT use $n notation (unlike preg_replace function).
Unlike preg_replace, mb_ereg_replace doesn't use separators
Exemple with preg_replace :
<?php $data = preg_replace("/[^A-Za-z0-9\.\-]/","",$data); ?>
Exemple with mb_ereg_replace :
<?php $data = mb_ereg_replace("[^A-Za-z0-9\.\-]","",$data); ?>
I got a pretty nasty error while trying to parse table rows(all contents were set to UTF-8) from the database for a dictionary project. The idea was to get all the rows from the first table (that is a table with bulgarian phrase in the first field, and its translation in english, french and german in the next fields). I needed to index all the bulgarian words that are found in the table to make an intelligent search. And that is where my headache started.
First of all, even with mb_strtolower() a lot of cyrillic characters went corrupted (ex: 'т,ъ,у,ф,б,г,з,ж,' etc...). After an hour of different attempts I got such a solution:
<?php
mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");
$rows = $db->getRows();
$contents = array();
foreach ($rows as $eachRow)
{
$cleared = str_replace($commonWords, ' ', mb_strtolower(stripslashes($eachRow['bulgarian']), 'UTF-8' ));
if (trim($cleared) != '') $contents[] = trim($cleared);
}
$list = array();
foreach ($contents as $eachRow)
{
$exploded = explode(' ', $eachRow);
foreach ($exploded as $eachExpl)
{
$eachExpl = mb_ereg_replace('[^а-я ]',' ', $eachExpl);
if (trim($eachExpl) != '')
if (!in_array($eachExpl, $list, true)) $list[] = trim($eachExpl);
}
}
?>
To work properly I got to set all the internal encoding settings to UTF-8. Else the default Latin-1 got half my database with missing characters.
I am posting this solution just in case someone has encountered a similar problem. Hope it helps you in case you need something like that.
<?php
$pattern = "([あ-ん]+)[0-9]+";
$string = mb_ereg_replace($pattern, '「\\1」:\\0', $string);
?>
you can use \\n for capture group in replacement
If you want to replace characters like "ä" or "ø" you can use mb_ereg_replace, but it is very slow. str_replace is much faster and also works with characters like "ä" or "ø"!
I think this has something to with the fact that str_replace works on byte level and does not care about characters.
I hope that can help.
'i' option does not work correctly with multibyte characters. The function does not locate/replace the multibyte string if it's different case then specified on multibyte needle which is in different case.
well, if you just calculated the length of the find and replace strings once instead of on every loop, it would likely speed it up a lot.
Regarding the mb_str_ireplace() function: I benchmarked it against mb_eregi_replace() for single-character substitution, and it was significantly slower. Despite avoiding the ereg call, I think the while loop ends slowing you down too much for this to be practical.
Are you looking for htmlentities() for multibyte strings? This might help you - it just replace <, >, ", '
<?php
/**
* Multibyte equivalent for htmlentities() [lite version :)]
*
* @param string $str
* @param string $encoding
* @return string
**/
function mb_htmlentities($str, $encoding = 'utf-8') {
mb_regex_encoding($encoding);
$pattern = array('<', '>', '"', '\'');
$replacement = array('<', '>', '"', ''');
for ($i=0; $i<sizeof($pattern); $i++) {
$str = mb_ereg_replace($pattern[$i], $replacement[$i], $str);
}
return $str;
}
?>
