grapheme_extract

(PHP 5 >= 5.3.0, PHP 7, PHP 8, PECL intl >= 1.0.0)

grapheme_extract — UTF-8 kodlanmış bir metin tamponundan öntanımlı sesletim kümelerinden oluşan bir dizilimi döndürür

Açıklama

Yordamsal kullanım

function grapheme_extract(
    string $samanlık,
    int $boyut,
    int $tür = GRAPHEME_EXTR_COUNT,
    int $başlangıç = 0,
    int &$sonraki = null
): string

UTF-8 kodlanmış bir metin tamponundan öntanımlı sesletim kümelerinden oluşan bir dizilimi döndürür.

Bağımsız Değişkenler

samanlık

Aramanın yapılacağı dizge.

boyut

Döndürülecek azami tür sayısı.

tür

boyut bağımsız değişkenini oluşturacak tür:

GRAPHEME_EXTR_COUNT (öntanımlı) - boyut, döndürülecek öntanımlı sesletim kümesi sayısıdır.
GRAPHEME_EXTR_MAXBYTES - boyut, döndürülecek azami bayt sayısıdır.
GRAPHEME_EXTR_MAXCHARS - boyut, döndürülecek UTF-8 karakterlerin azami sayısıdır.

başlangıç

Bayt cinsinden aramanın başlatılacağı konum. Belirtildiği takdirde, sıfır veya samanlık uzunluğuna eşit veya daha küçük bir tamsayı olmalıdır. Negatif değerler samanlık'ın sonundan itibarten sayılır. başlangıç bir UTF-8 karakterin ilk baytı değilse, sonraki karakterin ilk baytına taşınır.

sonraki

Sonraki aramanın başlangıç konumu. Çağrı sonunda dönen dizgenin son karakterinden sonraki ilk baytın konumu olacaktır.

Dönen Değerler

Belirtilen başlangıç konumunda başlayıp, boyut ve tür bağımsız değişkenlerine göre uzunluğu belirlenen bir öntanımlı sesletim kümesini içeren bir dizge ile döner, başarısızlık durumunda false döner.

Sürüm Bilgisi

Sürüm:	Açıklama
7.1.0	`başlangıç` artık negatif olabiliyor.

Örnekler

Örnek 1 - grapheme_extract() örneği

<?php
$char_a_ring_nfd = "a\xCC\x8A";      // 'å' (U+00E5) normalleştirme biçimi "D"
$char_o_diaeresis_nfd = "o\xCC\x88"; // 'ö' (U+00F6) normalleştirme biçimi "D"

print urlencode(grapheme_extract( $char_a_ring_nfd . $char_o_diaeresis_nfd, 1,
                                  GRAPHEME_EXTR_COUNT, 2));

?>

Yukarıdaki örneğin çıktısı:

o%CC%88

Ayrıca Bakınız

grapheme_substr() - Bir alt dizge döndürür
» Unicode Text Segmentation: Grapheme Cluster Boundaries

Found A Problem?

Learn How To Improve This Page • Submit a Pull Request • Report a Bug

＋add a note

User Contributed Notes 3 notes

down

AJH ¶

15 years ago

Here's how to use grapheme_extract() to loop across a UTF-8 string character by character.

<?php

$str = "سabcक’…";
// if the previous line didn't come through, the string contained:
//U+0633,U+0061,U+0062,U+0063,U+0915,U+2019,U+2026

$n = 0;

for (    $start = 0, $next = 0, $maxbytes = strlen($str), $c = '';
        $start < $maxbytes;
        $c = grapheme_extract($str, 1, GRAPHEME_EXTR_MAXCHARS , ($start = $next), $next)
    )
{
    if (empty($c))
        continue;
    echo "This utf8 character is " . strlen($c) . " bytes long and its first byte is " . ord($c[0]) . "\n";
    $n++;
}
echo "$n UTF-8 characters in a string of $maxbytes bytes!\n";
// Should print: 7 UTF8 characters in a string of 14 bytes!
?>

down

Philo ¶

2 years ago

The other comments on this page were helpful for me.
However, consider using something better than empty($value) when checking the value returned by grapheme_extract since it could as well return something like "0" (which of course evaluates to false).

down

yevgen dot grytsay at gmail dot com ¶

5 years ago

Looping through grapheme clusters:

<?php

// Example taken from Rust documentation: https://doc.rust-lang.org/book/ch08-02-strings.html#bytes-and-scalar-values-and-grapheme-clusters-oh-my
$str = "नमस्ते";
// Alternatively:
//$str = pack('C*', ...[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]);
$next = 0;
$maxbytes = strlen($str);

var_dump($str);

while ($next < $maxbytes) {
    $char = grapheme_extract($str, 1, GRAPHEME_EXTR_COUNT, $next, $next);
    if (empty($char)) {
        continue;
    }
    echo "{$char} - This utf8 character is " . strlen($char) . ' bytes long', PHP_EOL;
}

//string(18) "नमस्ते"
//न - This utf8 character is 3 bytes long
//म - This utf8 character is 3 bytes long
//स् - This utf8 character is 6 bytes long
//ते - This utf8 character is 6 bytes long
?>

＋add a note