PHP: Propiedades de los caracteres Unicode

Propiedades de los caracteres Unicode

A partir de 5.1.0, están disponibles tres secuencias de escape adicionales para comparar tipos de caracteres genéricos cuando el modo UTF-8 está seleccionado. Son:

\p{xx}: un carácter con la propiedad xx
\P{xx}: un carácter sin la propiedad xx
\X: una secuencia Unicode extendida

Los nombres de las propiedades representadas arriba por xx están limitadas a las propiedades de la categoría general de Unicode. Cada carácter tiene exactamente una propiedad, especificada por una abreviatura de dos letras. Por compatibilidad con Perl, la negación se puede especificar incluyendo un acento circunflejo entre la llave de apertura y el nombre de la propiedad. Por ejemplo, \p{^Lu} es lo mismo que \P{Lu}.

Si sólo se especifica una letra con \p o \P, se incluyen todas las propiedades que comienzan con esa letra. En este caso, en la ausencia de negación, las llaves en la secuencia de escape son opcionales; estos dos ejemplos tienen el mismo efecto:

\p{L}
\pL

**Códigos de propiedades admitidos**
Propiedad	Coincidencias	Notas
`C`	Otro
`Cc`	Control
`Cf`	Formato
`Cn`	Sin asignar
`Co`	Uso privado
`Cs`	Sustituto
`L`	Letra	Incluye las siguientes propiedades: `Ll`, `Lm`, `Lo`, `Lt` y `Lu`.
`Ll`	Letra minúscula
`Lm`	Letra modificadora
`Lo`	Otra letra
`Lt`	Letra de título
`Lu`	Letra mayúscula
`M`	Marca
`Mc`	Marca de espacio
`Me`	Marca de cierre
`Mn`	Marca de no-espacio
`N`	Número
`Nd`	Número decimal
`Nl`	Número letra
`No`	Otro número
`P`	Puntuación
`Pc`	Puntuación de conexión
`Pd`	Puntuación guión
`Pe`	Puntuación de cierre
`Pf`	Puntuación final
`Pi`	Puntuación inicial
`Po`	Otra puntuación
`Ps`	Puntuación de apertura
`S`	Símbolo
`Sc`	Símbolo de moneda
`Sk`	Símbolo modificador
`Sm`	Símbolo matemático
`So`	Otro símbolo
`Z`	Separador
`Zl`	Separador de línea
`Zp`	Separador de párrafo
`Zs`	Separador de espacio

Las propiedades extendidas tales como InMusicalSymbols no están admitidas por PCRE.

El especificar coincidicencias insensibles a mayúsculas-minúsculas no afecta a estas secuencias de escape. Por ejemplo, \p{Lu} siempre coincide con letras mayúsculas.

Los conjuntos de caracteres Unicode están definidos como pertenecientes a ciertos alfabetos. Se puede hacer coincidir un carácter de uno de estos conjuntos usando un nombre de alfabeto. Por ejemplo:

\p{Greek}
\P{Han}

Aquellos que no son parte de un alfabeto identificado, son metidos en el mismo saco como Common. La lista actual de alfabetos es:

**Supported scripts**
`Arabic`	`Armenian`	`Avestan`	`Balinese`	`Bamum`
`Batak`	`Bengali`	`Bopomofo`	`Brahmi`	`Braille`
`Buginese`	`Buhid`	`Canadian_Aboriginal`	`Carian`	`Chakma`
`Cham`	`Cherokee`	`Common`	`Coptic`	`Cuneiform`
`Cypriot`	`Cyrillic`	`Deseret`	`Devanagari`	`Egyptian_Hieroglyphs`
`Ethiopic`	`Georgian`	`Glagolitic`	`Gothic`	`Greek`
`Gujarati`	`Gurmukhi`	`Han`	`Hangul`	`Hanunoo`
`Hebrew`	`Hiragana`	`Imperial_Aramaic`	`Inherited`	`Inscriptional_Pahlavi`
`Inscriptional_Parthian`	`Javanese`	`Kaithi`	`Kannada`	`Katakana`
`Kayah_Li`	`Kharoshthi`	`Khmer`	`Lao`	`Latin`
`Lepcha`	`Limbu`	`Linear_B`	`Lisu`	`Lycian`
`Lydian`	`Malayalam`	`Mandaic`	`Meetei_Mayek`	`Meroitic_Cursive`
`Meroitic_Hieroglyphs`	`Miao`	`Mongolian`	`Myanmar`	`New_Tai_Lue`
`Nko`	`Ogham`	`Old_Italic`	`Old_Persian`	`Old_South_Arabian`
`Old_Turkic`	`Ol_Chiki`	`Oriya`	`Osmanya`	`Phags_Pa`
`Phoenician`	`Rejang`	`Runic`	`Samaritan`	`Saurashtra`
`Sharada`	`Shavian`	`Sinhala`	`Sora_Sompeng`	`Sundanese`
`Syloti_Nagri`	`Syriac`	`Tagalog`	`Tagbanwa`	`Tai_Le`
`Tai_Tham`	`Tai_Viet`	`Takri`	`Tamil`	`Telugu`
`Thaana`	`Thai`	`Tibetan`	`Tifinagh`	`Ugaritic`
`Vai`	`Yi`

El escape \X coincide con un cluster de grafemas ampliado de Unicode. Un clúster de grafemas ampliado es uno o más caracteres Unicode que se combinan para formar un único glifo. A todos los efectos, se puede pensar en ello como el equivalente Unicode de . ya que coincidirá con un carácter compuesto, independientemente de cúantos caracteres individuales se usan en realidad para representarlo.

En versiones de PCRE anteriores a la 8.32 (las cuales se corresponden con versiones de PHP anteriores a la 5.4.14 al usar la biblioteca PCRE incluida), \X es equivalente a (?>\PM\pM*). Esto es, coincide con un carácter sin la propiedad "marca", seguido de cero o más caracteres con la propiedad "marca", y trata la secuencia como un grupo atómico (véase más abajo). Los caracteres con la propiedad "marca" son normalmente acentos que afectan al carácter predecente.

La comparación de caracteres por propiedades Unicode no es rápida, porque PCRE ha de buscar una estructura que contiene datos por más de quince mil caracteres. Es por esto por lo que las secuencias de escape tradicionales tales como \d y \w no usan propiedades Unicode en PCRE.

Improve This Page

Learn how improve this page • Submit a Pull Request • Report a Bug

＋add a note

User Contributed Notes 10 notes

down

huhwatnouDONTspamPLEASE at hotmail dot com ¶

8 years ago

To select UTF-8 mode for the additional escape sequences (\p{xx}, \P{xx}, and \X) , use the "u" modifier (see http://php.net/manual/en/reference.pcre.pattern.modifiers.php).

I wondered why a German sharp S (ß) was marked as a control character by \p{Cc} and it took me a while to properly read the first sentence: "Since 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. " :-$ and then to find out how to do so.

down

Steve ¶

8 months ago

Examples are always useful! See https://unicodeplus.com/category for more.

C    Other     
Cc   Control      (Unicode code points in the ranges U+0000-U+001F and U+007F-U+009F)
Cf   Format       (Soft hyphen (U+00AD), zero width space (U+200B), etc.)
Cn   Unassigned   (Any code point that is not in the Unicode table)
Co   Private use     
Cs   Surrogate    (Characters in the range U+D800 to U+DFFF, which are invalid in utf-8)

L    Letter
Ll   Lower case letter (a-z, µßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ and more)
Lm   Modifier letter   (Letter-like characters that are usually combined with others, but here they stand alone:
                        ʰʱʲʳʴʵʶʷʸʹʺʻʼʽʾʿˀˁˆˇˈˉˊˋˌˍˎˏːˑˠˡˢˣˤˬˮʹͺՙ and more)
Lo   Other letter      (ªºƻǀǁǂǃʔ and many more ideographs and letters from unicase alphabets)
Lt   Title case letter (ǅǈǋǲᾈᾉᾊᾋᾌᾍᾎᾏᾘᾙᾚᾛᾜᾝᾞᾟᾨᾩᾪᾫᾬᾭᾮᾯᾼῌῼ)
Lu   Upper case letter (A-Z, ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ and more)
L&   Ordinary letter   (Any character that has the Lu, Ll, or Lt property)

M    Mark
Mc   Spacing mark      (None in latin scripts)
Me   Enclosing mark    (Combining enclosing square (U+20DE) like in a⃞ , combining enclosing circle backslash (U+20E0) like in a⃠)
Mn   Non-spacing mark  (Combining diacritical marks U+0300-U+036f, like the accents on this letter a: áâãāa̅ăȧäảåa̋ǎa̍a̎ȁa̐ȃ)

N    Number      
Nd   Decimal number (0123456789, ٠١٢٣٤٥٦٧٨٩ and digits in many other scripts.)
Nl   Letter number  (ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿ and some more)
No   Other number   (⁰¹²³⁴⁵⁶⁷⁸⁹ ₀₁₂₃₄₅₆₇₈₉ ½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒ ①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳, etc.)

P    Punctuation      
Pc   Connector punctuation (_ underscore (U+005F), ‿ undertie U+203F, ⁀ character tie (U+2040), etc.)
Pd   Dash punctuation      (- hyphen-minus (U+002D), ‐ hyphen (U+2010), ‑ non-breaking hyphen (U+2011), ‒ figure dash (U+2012),
                            – en dash (U+2013), — em dash (U+2014), ― horizontal bar (U+2015), etc.)
Pe   Close punctuation     (right parenthesis, bracket, or brace: `)` (U+0029), `]` (U+005D), `}` (U+007D), etc.) 
Pf   Final punctuation     (right quotation marks: » (U+00BB), ’ (U+2019), ” (U+201D), etc.)
Pi   Initial punctuation   (left  quotation marks: « (U+00AB), ‘ (U+2018), “ (U+201C), etc.)
Po   Other punctuation     (!"#%&'*,./:;?@\¡§¶·¿)
Ps   Open punctuation      (left parenthesis, bracket, or brace: `(` (U+0028), `[` (U+005B), `{` (U+007B), etc.) 

S    Symbol      
Sc   Currency symbol     ($¢£¤¥, ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ₺ ₻ ₼ ₽ ₾ ₿ (U+20A0-U+20BF), etc.)
Sk   Modifier symbol     (Symbol-like characters that are usually combined with others, but here they stand alone:
                          ^`¨¯´¸ and more)
Sm   Mathematical symbol (+<=>|~¬±×÷϶ and many more)
So   Other symbol        (¦ broken bar (U+00A6), © copyright sign (U+00A9), ® registered sign (U+00AE), ° degree sign (U+00B0);
                          arrows, signs, emojis and many many more)

Z    Separator      
Zl   Line separator      (line separator (U+2028))
Zp   Paragraph separator (paragraph separator (U+2029))
Zs   Space separator     (space, no-break space, en quad, em quad, en space, em space, figure space, thin space, hair space, etc.)

down

mercury at caucasus dot net ¶

13 years ago

An excellent article explaining all these properties can be found here: http://www.regular-expressions.info/unicode.html

down

xuantoaiph at gmail dot com ¶

10 years ago

My country, Vietnam, have our own alphabet table:
http://en.wikipedia.org/wiki/Vietnamese_alphabet
I hope PHP will support better than in Vietnamese.

down

o_shes01 at uni-muenster dot de ¶

13 years ago

For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter. 
For example, there are three codepoints for the "LJ" digraph in Unicode: 
  (*) uppercase "LJ": U+01C7 
  (*) titlecase "Lj": U+01C8 
  (*) lowercase "lj": U+01C9

down

suit at rebell dot at ¶

14 years ago

these properties are usualy only available if PCRE is compiled with "--enable-unicode-properties"



if you want to match any word but want to provide a fallback, you can do something like that: 



<?php

if(@preg_match_all('/\p{L}+/u', $str, $arr) {

  // fallback goes here

  // for example just '/\w+/u' for a less acurate match

}

?>

down

php at lnx-bsp dot net ¶

6 years ago

Not made clear in the top of page explanation, but these escaped character classes can be included within square brackets to make a broader character class. For example:

<?php preg_match( '/[\p{N}\p{L}]+/', $data ) ?>

Will match any combination of letters and numbers.

down

Yzmir Ramirez ¶

10 years ago

If you are working with older environments you will need to first check to see if the version of PCRE will work with unicode directives described above:

<?php

// Need to check PCRE version because some environments are
// running older versions of the PCRE library
// (run in *nix environment `pcretest -C`)

$allowInternational = false;
if (defined('PCRE_VERSION')) {
    if (intval(PCRE_VERSION) >= 7) { // constant available since PHP 5.2.4
        $allowInternational = true;
    }
}
?>

Now you can do a fallback regex (e.g. use "/[a-z]/i"), when the PCRE library version is too old or not available.

down

-5

o_shes01 at uni-muenster dot de ¶

13 years ago

For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter. 
For example, there are three codepoints for the "LJ" digraph in Unicode: 
  (*) uppercase "LJ": U+01C7 
  (*) titlecase "Lj": U+01C8 
  (*) lowercase "lj": U+01C9

down

-3

phpnet at N_O_S_P_A_M dot osps dot net ¶

1 year ago

I found the predefined "supported" scripts helpful, except that there's no apparent definition of what Unicode character ranges are covered by those definitions. So I wrote this to determine them and print out the equivalent PCRE character class definitions. An example fragment of output is (I can't include all output due to PHP.net Note-posting limits)

Canadian_Aboriginal=[\x{1400}-\x{167f}\x{18b0}-\x{18f5}]

The program:

<?php

$scriptNames = array(
    'Arabic',
    'Armenian',
    'Avestan',
    'Balinese',
    'Bamum',
    'Batak',
    'Bengali',
    'Bopomofo',
    'Brahmi',
    'Braille',
    'Buginese',
    'Buhid',
    'Canadian_Aboriginal',
    'Carian',
    'Chakma',
    'Cham',
    'Cherokee',
    'Common',
    'Coptic',
    'Cuneiform',
    'Cypriot',
    'Cyrillic',
    'Deseret',
    'Devanagari',
    'Egyptian_Hieroglyphs',
    'Ethiopic',
    'Georgian',
    'Glagolitic',
    'Gothic',
    'Greek',
    'Gujarati',
    'Gurmukhi',
    'Han',
    'Hangul',
    'Hanunoo',
    'Hebrew',
    'Hiragana',
    'Imperial_Aramaic',
    'Inherited',
    'Inscriptional_Pahlavi',
    'Inscriptional_Parthian',
    'Javanese',
    'Kaithi',
    'Kannada',
    'Katakana',
    'Kayah_Li',
    'Kharoshthi',
    'Khmer',
    'Lao',
    'Latin',
    'Lepcha',
    'Limbu',
    'Linear_B',
    'Lisu',
    'Lycian',
    'Lydian',
    'Malayalam',
    'Mandaic',
    'Meetei_Mayek',
    'Meroitic_Cursive',
    'Meroitic_Hieroglyphs',
    'Miao',
    'Mongolian',
    'Myanmar',
    'New_Tai_Lue',
    'Nko',
    'Ogham',
    'Old_Italic',
    'Old_Persian',
    'Old_South_Arabian',
    'Old_Turkic',
    'Ol_Chiki',
    'Oriya',
    'Osmanya',
    'Phags_Pa',
    'Phoenician',
    'Rejang',
    'Runic',
    'Samaritan',
    'Saurashtra',
    'Sharada',
    'Shavian',
    'Sinhala',
    'Sora_Sompeng',
    'Sundanese',
    'Syloti_Nagri',
    'Syriac',
    'Tagalog',
    'Tagbanwa',
    'Tai_Le',
    'Tai_Tham',
    'Tai_Viet',
    'Takri',
    'Tamil',
    'Telugu',
    'Thaana',
    'Thai',
    'Tibetan',
    'Tifinagh',
    'Ugaritic',
    'Vai',
    'Yi'
);
$scriptTypes = array();
foreach( $scriptNames as $n ) $scriptTypes[ $n ] = array();
for( $i=0; $i <= 0x10fff; $i++ ) {
//echo $i.PHP_EOL;
    foreach( $scriptNames as $scriptName ) {

        if ( preg_match( '/[\p{'. $scriptName .'}]/u', mb_chr( $i, 'UTF-8') ) ) {

            if (empty( $scriptTypes[ $scriptName ])
                || ( ($i - $scriptTypes[ $scriptName ][ count( $scriptTypes[ $scriptName ] ) - 1 ][1]) > 1)
            ) {

                $scriptTypes[ $scriptName ][] = [$i, $i];

            } else {

                $scriptTypes[ $scriptName ][ count( $scriptTypes[ $scriptName ] ) - 1 ][1] = $i;
            }
        }
    }
}
foreach( $scriptTypes as $scriptName => $unicodeRanges ) {

    printf(
        '%s=[',
        $scriptName
    );
    foreach( $unicodeRanges as $r ) {

        printf(
            '\x{%04x}',
            $r[0]
        );
        if ($r[1] > $r[0] )
            printf(
                '-\x{%04x}',
                $r[1]
            );
    }
    printf(
        ']'.PHP_EOL
    );
}

＋add a note