"line break" is ill-defined:
-- Windows uses CR+LF (\r\n)
-- Linux LF (\n)
-- OSX CR (\r)
Little-known special character:
\R in preg_* matches all three.
preg_match( '/^\R$/', "match\nany\\n\rline\r\nending\r" ); // match any line endings
反斜线有多种用法。首先,如果紧接着是一个非字母数字字符,表明取消 该字符所代表的特殊涵义。这种将反斜线作为转义字符的用法在字符类 内部和外部都可用。
比如,如果你希望匹配一个 "*" 字符,就需要在模式中写为 "\*"。 这适用于一个字符在不进行转义会有特殊含义的情况下。 但是, 对于非数字字母的字符,总是在需要其进行原文匹配的时候在它前面增加一个反斜线, 来声明它代表自己,这是安全的。如果要匹配一个反斜线,那么在模式中使用 ”\\”。
注意:
反斜线在单引号和双引号 PHP 字符串 中都有特殊含义。因此要匹配一个反斜线 \,正则表达式写法是 \\, 然后 PHP 代码中需要转义写成 "\\\\" 或 '\\\\'。
如果一个模式被使用 PCRE_EXTENDED 选项编译, 模式中的空白字符(除了字符类中的)和未转义的#到行末的所有字符都会被忽略。 要在这种情况下使用空白字符或者#,就需要对其进行转义。
反斜线的第二种用途提供了一种对非打印字符进行可见编码的控制手段。 除了二进制的 0 会终结一个模式外,并不会严格的限制非打印字符(自身)的出现, 但是当一个模式以文本编辑器的方式编辑准备的时候, 使用下面的转义序列相比使用二进制字符会更加容易:
\cx
的确切效果如下: 如果x
是一个小写字母,它被转换为大写。接着,
将字符的第6位(十六进制 40,右数第一个位为第0位)取反。
比如\cz
成为十六进制的1A,\c{
成为十六进制3B,
\c;
成为十六进制7B。
在”\x
”后面,读取两个十六进制数(字母可以是大写或小写)。
在UTF-8模式,
“\x{…}
”允许使用, 花括号内的内容是十六进制有效数字。
它将给出的十六进制数字解释为 UTF-8 字符代码。原来的十六进制转义序列,
\xhh
,
匹配一个双字节的UTF-8字符,如果它的值大于127
在”\0
”之后, 读取两个八进制数。所有情况下,如果数少于2个,则直接使用。
序列 ”\0\x\07
” 指定了两个二进制 0 紧跟着一个 BEL 字符。
请确保初始的 0 之后的两个数字是合法的八进制数。
处理一个反斜线紧跟着的不是0的数字的情况比较复杂。在字符类外部, PCRE 读取它并以十进制读取紧随其后的数字。 如果数值小于 10, 或者之前捕获到了该数字能够代表的左括号(子组), 整个数字序列被认为是后向引用。后向引用如何工作在后面描述, 接下来就会讨论括号子组。
在一个字符类里面,或者十进制数大于 9 并且没有那么多的子组被捕获, PCRE 重新读取反斜线后的第三个 8 进制数字,并且从最低的 8 位生成单字节值。 任何的后续数字都代表它们自身。例如:
注意,八进制值的 100 或者更大的值必须没有前置的0引导, 因为每次最多读取3个8进制位.
所有序列定义的单字节值都可以在字符类内部或外部使用。另外,在字符类中,
序列 ”\b
” 解释为退格字符(hex 08)。
字符类外它又有不同的意义(下面有描述)。
反斜线的第三种用法是用来描述特定的字符类:
上面每一对转义序列都代表了完整字符集中两个不相交的部分, 任意字符一定会匹配其中一个,同时一定不会匹配另外一个。
"空白字符"(whitespace)是 HT (9)、LF (10)、FF (12)、CR (13)、space (32)。 然而,若发生了本地化匹配,在代码点 128-255 范围内亦可能出现空白字符, 比如说 NBSP (A0)。
单词字符指的是任意字母、数字、下划线。
也就是说任意可以组成perl单词的字符。
字母和数字的定义通过PCRE字符表控制,可以通过指定地域设置使其匹配改变。比如,
在法国 (fr) 本地化设置中,一些超过 128 的字符代码被用于重音字母,
它们可以实用 \w
匹配。
这些字符类序列在字符类内部或外部都可以出现。 他们每次匹配所代表的字符类型中的一个字符。 如果当前匹配点位于目标字符串末尾, 它们中的所有字符都匹配失败, 因为没有字符让它们匹配了。
反斜线的第四种用法是一些简单的断言。 一个断言指定一个必须在特定位置匹配的条件, 它们不会从目标字符串中消耗任何字符。 接下来我们会讨论使用子组的更加复杂的断言。 反斜线断言包括:
这些断言不能出现在字符类中(但是注意, “\b
”在字符类中有不同的意义,
表示的是退格(backspace)字符)
一个单词边界表示的是在目标字符串中,
当前字符和前一个字符不同时匹配\w
或\W
(一个匹配\w
,
一个匹配\W
),
或者作为字符串开始或结尾字符的时候当前字符匹配\w
。
\A
, \Z
,
\z
断言不同于传统的^
和$
(详见 锚),
因为他们永远匹配目标字符串的开始和结尾,而不会受模式修饰符的限制。
它们不受PCRE_MULTILINE,PCRE_DOLLAR_ENDONLY选项的影响。
\Z
和 \z
之间的不同在于当字符串结束字符时换行符时 \Z
会将其看做字符串结尾匹配,
而 \z
只匹配字符串结尾。
\G
断言在指定了offset
参数的preg_match() 调用中,
仅在当前匹配位置在匹配开始点的时候才是成功的。
当 offset
的值不为 0 的时候,
它与 \A
是不同的。 译注:另外一点与 \A
的不同之处在于使用 preg_match_all() 时,
每次匹配 \G
只是断言是否是匹配结果的开始位置,
而 \A
断言的则是匹配结果的开始位置是否在目标字符串开始位置。
\Q
和 \E
可以用于在模式中忽略正则表达式元字符。比如:\w+\Q.$.\E$
会匹配一个或多个单词字符,紧接着 .$.
,最后锚向字符串末尾。注意这不会改变分隔符的行为;例如模式
#\Q#\E#$
无效,因为第二个 #
标记了模式的结束,而 \E#
解释为无效的修饰符。
\K
可以用于重置匹配。 比如,
foot\Kbar
匹配”footbar”。
但是得到的匹配结果是 ”bar”。但是, \K
的使用不会干预到子组内的内容,
比如 (foot)\Kbar
匹配 ”footbar”,第一个子组内的结果仍然会是 ”foo”。译注:
\K 放在子组和子组外面的效果是一样的。
"line break" is ill-defined:
-- Windows uses CR+LF (\r\n)
-- Linux LF (\n)
-- OSX CR (\r)
Little-known special character:
\R in preg_* matches all three.
preg_match( '/^\R$/', "match\nany\\n\rline\r\nending\r" ); // match any line endings
Significantly updated version (with new $pat4 utilising \R properly, its results and comments):
Note that there are (sometimes difficult to grasp at first glance) nuances of meaning and application of escape sequences like \r, \R and \v - none of them is perfect in all situations, but they are quite useful nevertheless. Some official PCRE control options and their changes come in handy too - unfortunately neither (*ANYCRLF), (*ANY) nor (*CRLF) is documented here on php.net at the moment (although they seem to be available for over 10 years and 5 months now), but they are described on Wikipedia ("Newline/linebreak options" at https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions) and official PCRE library site ("Newline convention" at http://www.pcre.org/original/doc/html/pcresyntax.html#SEC17) pretty well. The functionality of \R appears somehow disappointing (with default configuration of compile time option) according to php.net as well as official description ("Newline sequences" at https://www.pcre.org/original/doc/html/pcrepattern.html#newlineseq) when used improperly.
A hint for those of you who are trying to fight off (or work around at least) the problem of matching a pattern correctly at the end ($) of any line in multiple lines mode (/m).
<?php
// Various OS-es have various end line (a.k.a line break) chars:
// - Windows uses CR+LF (\r\n);
// - Linux LF (\n);
// - OSX CR (\r).
// And that's why single dollar meta assertion ($) sometimes fails with multiline modifier (/m) mode - possible bug in PHP 5.3.8 or just a "feature"(?).
$str="ABC ABC\n\n123 123\r\ndef def\rnop nop\r\n890 890\nQRS QRS\r\r~-_ ~-_";
// C 3 p 0 _
$pat1='/\w$/mi'; // This works excellent in JavaScript (Firefox 7.0.1+)
$pat2='/\w\r?$/mi'; // Slightly better
$pat3='/\w\R?$/mi'; // Somehow disappointing according to php.net and pcre.org when used improperly
$pat4='/\w(?=\R)/i'; // Much better with allowed lookahead assertion (just to detect without capture) without multiline (/m) mode; note that with alternative for end of string ((?=\R|$)) it would grab all 7 elements as expected
$pat5='/\w\v?$/mi';
$pat6='/(*ANYCRLF)\w$/mi'; // Excellent but undocumented on php.net at the moment (described on pcre.org and en.wikipedia.org)
$n=preg_match_all($pat1, $str, $m1);
$o=preg_match_all($pat2, $str, $m2);
$p=preg_match_all($pat3, $str, $m3);
$r=preg_match_all($pat4, $str, $m4);
$s=preg_match_all($pat5, $str, $m5);
$t=preg_match_all($pat6, $str, $m6);
echo $str."\n1 !!! $pat1 ($n): ".print_r($m1[0], true)
."\n2 !!! $pat2 ($o): ".print_r($m2[0], true)
."\n3 !!! $pat3 ($p): ".print_r($m3[0], true)
."\n4 !!! $pat4 ($r): ".print_r($m4[0], true)
."\n5 !!! $pat5 ($s): ".print_r($m5[0], true)
."\n6 !!! $pat6 ($t): ".print_r($m6[0], true);
// Note the difference among the three very helpful escape sequences in $pat2 (\r), $pat3 and $pat4 (\R), $pat5 (\v) and altered newline option in $pat6 ((*ANYCRLF)) - for some applications at least.
/* The code above results in the following output:
ABC ABC
123 123
def def
nop nop
890 890
QRS QRS
~-_ ~-_
1 !!! /\w$/mi (3): Array
(
[0] => C
[1] => 0
[2] => _
)
2 !!! /\w\r?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
3 !!! /\w\R?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
4 !!! /\w(?=\R)/i (6): Array
(
[0] => C
[1] => 3
[2] => f
[3] => p
[4] => 0
[5] => S
)
5 !!! /\w\v?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
6 !!! /(*ANYCRLF)\w$/mi (7): Array
(
[0] => C
[1] => 3
[2] => f
[3] => p
[4] => 0
[5] => S
[6] => _
)
*/
?>
Unfortunately, I haven't got any access to a server with the latest PHP version - my local PHP is 5.3.8 and my public host's PHP is version 5.2.17.
As \v matches both single char line ends (CR, LF) and double char (CR+LF, LF+CR), it is not a fixed length atom (eg. is not allowed in lookbehind assertions).
A non breaking space is not considered as a space and cannot be caught by \s.
it can be found with :
- [\xc2\xa0] in utf-8
- \x{00a0} in unicode
Note that there are (sometimes difficult to grasp at first glance) nuances of meaning and application of escape sequences like \r, \R and \v - none of them is perfect in all situations, but they are quite useful nevertheless. Some official PCRE control options and their changes come in handy too - unfortunately neither (*ANYCRLF), (*ANY) nor (*CRLF) is documented here on php.net at the moment (although they seem to be available for over 10 years and 5 months now), but they are described on Wikipedia ("Newline/linebreak options" at https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions) and official PCRE library site ("Newline convention" at http://www.pcre.org/original/doc/html/pcresyntax.html#SEC17) pretty well. The functionality of \R appears somehow disappointing (with default configuration of compile time option) according to php.net as well as official description ("Newline sequences" at https://www.pcre.org/original/doc/html/pcrepattern.html#newlineseq).
A hint for those of you who are trying to fight off (or work around at least) the problem of matching a pattern correctly at the end ($) of any line in multiple lines mode (/m).
<?php
// Various OS-es have various end line (a.k.a line break) chars:
// - Windows uses CR+LF (\r\n);
// - Linux LF (\n);
// - OSX CR (\r).
// And that's why single dollar meta assertion ($) sometimes fails with multiline modifier (/m) mode - possible bug in PHP 5.3.8 or just a "feature"(?).
$str="ABC ABC\n\n123 123\r\ndef def\rnop nop\r\n890 890\nQRS QRS\r\r~-_ ~-_";
// C 3 p 0 _
$pat1='/\w$/mi'; // This works excellent in JavaScript (Firefox 7.0.1+)
$pat2='/\w\r?$/mi';
$pat3='/\w\R?$/mi'; // Somehow disappointing according to php.net and pcre.org
$pat4='/\w\v?$/mi';
$pat5='/(*ANYCRLF)\w$/mi'; // Excellent but undocumented on php.net at the moment
$n=preg_match_all($pat1, $str, $m1);
$o=preg_match_all($pat2, $str, $m2);
$p=preg_match_all($pat3, $str, $m3);
$r=preg_match_all($pat4, $str, $m4);
$s=preg_match_all($pat5, $str, $m5);
echo $str."\n1 !!! $pat1 ($n): ".print_r($m1[0], true)
."\n2 !!! $pat2 ($o): ".print_r($m2[0], true)
."\n3 !!! $pat3 ($p): ".print_r($m3[0], true)
."\n4 !!! $pat4 ($r): ".print_r($m4[0], true)
."\n5 !!! $pat5 ($s): ".print_r($m5[0], true);
// Note the difference among the three very helpful escape sequences in $pat2 (\r), $pat3 (\R), $pat4 (\v) and altered newline option in $pat5 ((*ANYCRLF)) - for some applications at least.
/* The code above results in the following output:
ABC ABC
123 123
def def
nop nop
890 890
QRS QRS
~-_ ~-_
1 !!! /\w$/mi (3): Array
(
[0] => C
[1] => 0
[2] => _
)
2 !!! /\w\r?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
3 !!! /\w\R?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
4 !!! /\w\v?$/mi (5): Array
(
[0] => C
[1] => 3
[2] => p
[3] => 0
[4] => _
)
5 !!! /(*ANYCRLF)\w$/mi (7): Array
(
[0] => C
[1] => 3
[2] => f
[3] => p
[4] => 0
[5] => S
[6] => _
)
*/
?>
Unfortunately, I haven't got any access to a server with the latest PHP version - my local PHP is 5.3.8 and my public host's PHP is version 5.2.17.