I have written a short introduction and a colorful cheat sheet for Perl Compatible Regular Expressions (PCRE):
http://www.bitcetera.com/en/techblog/2008/04/01/regex-in-a-nutshell/
Perl 兼容正则表达式函数
简介
本类函数中所使用的模式极其类似 Perl。表达式应被包含在定界符中,如斜线(/)。任何不是字母、数字或反斜线(\)的字符都可以作为定界符。如果作为定界符的字符必须被用在表达式本身中,则需要用反斜线转义。自 PHP 4.0.4 起,也可以使用 Perl 风格的 (),{},[] 和 <> 匹配定界符。详细解释见模式语法。
结束定界符的后面可以跟上不同的修正符以影响匹配方式。见模式修正符。
PHP 也支持 POSIX 扩展语法的正则表达式,见 POSIX 扩展正则表达式函数。
Note: 本扩展库保持有一个已编译的正则表达式的全局线程化缓存(最大 4096)。
要留意到 PCRE 的一些局限。更多信息见 » http://www.pcre.org/pcre.txt。
需求
要编译本扩展模块无需外部库文件。
安装
自 PHP 4.2.0 起这些函数默认被激活。可以通过 --without-pcre-regex 禁用 PCRE 函数。如果不使用绑定的库的话,用 --with-pcre-regex=DIR 来指定 PCRE 库文件和头文件的路径。对早期版本必须在编译时用 --with-pcre-regex[=DIR] 才能使用这些函数。
PHP 的 Windows 版本已经内置该扩展模块的支持。无需加载任何附加扩展库即可使用这些函数。
运行时配置
本扩展模块在 php.ini 中未定义任何配置选项。
资源类型
本扩展模块未定义任何资源类型。
预定义常量
以下常量由本扩展模块定义,因此只有在本扩展模块被编译到 PHP 中,或者在运行时被动态加载后才有效。
| 常量 | 说明 |
|---|---|
| PREG_PATTERN_ORDER | 对结果排序使得 $matches[0] 为整个模式的匹配结果的数组,$matches[1] 为第一个括号内的子模式所匹配的字符串的数组,等等。本标记仅用于 preg_match_all()。 |
| PREG_SET_ORDER | 对结果排序使得 $matches[0] 为第一组匹配结果的数组,$matches[1] 为第二组匹配结果的数组,等等。本标记仅用于 preg_match_all()。 |
| PREG_OFFSET_CAPTURE | 见 PREG_SPLIT_OFFSET_CAPTURE 的说明。本标记自 PHP 4.3.0 起可用。 |
| PREG_SPLIT_NO_EMPTY | 本标记使 preg_split() 仅返回非空的结果。 |
| PREG_SPLIT_DELIM_CAPTURE | 本标记使 preg_split() 也捕获定界符模式中的括号表达。本标记自 PHP 4.0.5 起可用。 |
| PREG_SPLIT_OFFSET_CAPTURE | 如果设定本标记,对每个出现的匹配结果也同时返回其附属的字符串偏移量。注意这改变了返回的数组的值,使其中的每个单元也是一个数组,其中第一项为匹配字符串,第二项为其偏移量。本标记自 PHP 4.3.0 起可用且仅用于 preg_split()。 |
范例
Example#1 合法的模式举例
- /<\/\w+>/
- |(\d{3})-\d+|Sm
- /^(?i)php[34]/
- {^\s+(\s+)?$}
Example#2 非法的模式举例
- /href='(.*)' - 缺少结束定界符
- /\w+\s*\w+/J - 未知的修正符 'J'
- 1-\d3-\d3-\d4| - 缺少起始定界符
Table of Contents
- 模式修正符 — 解说正则表达式模式中使用的修正符
- 模式语法 — 解说 Perl 兼容正则表达式的语法
- preg_grep — 返回与模式匹配的数组单元
- preg_last_error — Returns the error code of the last PCRE regex execution
- preg_match_all — 进行全局正则表达式匹配
- preg_match — 进行正则表达式匹配
- preg_quote — 转义正则表达式字符
- preg_replace_callback — 用回调函数执行正则表达式的搜索和替换
- preg_replace — 执行正则表达式的搜索和替换
- preg_split — 用正则表达式分割字符串
PCRE
10-Feb-2009 01:43
13-Sep-2007 08:42
One comment about 5.2.x and the pcre.backtrack_limit:
Note that this setting wasn't present under previous PHP releases and the behaviour (or limit) under those releases was, in practise, higher so all these PCRE functions were able to "capture" longer strings.
With the arrival of the setting, defaulting to 100000 (less than 100K), you won't be able to match/capture strings over that size using, for example "ungreedy" modifiers.
So, in a lot of situations, you'll need to raise that (very small IMO) limit.
The worst part is that PHP simply won't match/capture those strings over pcre.backtrack_limit and will it be 100% silent about that (I think that throwing some NOTICE/WARNING if raised could help a lot to developers).
There is a lot of people suffering this changed behaviour from I've read on forums, bugs and so on).
Hope this note helps, ciao :-)
05-May-2007 11:16
PCRE faster than POSIX RE? Not always.
In a recent search-engine project here at Cynergi, I had a simple loop with a few cute ereg_replace() functions that took 3min to process data. I changed that 10-line loop into a 100-line hand-written code for replacement and the loop now took 10s to process the same data! This opened my eye to what can *IN SOME CASES* be very slow regular expressions.
Lately I decided to look into Perl-compatible regular expressions (PCRE). Most pages claim PCRE are faster than POSIX, but a few claim otherwise. I decided on bechmarks of my own.
My first few tests confirmed PCRE to be faster, but... the results were slightly different than others were getting, so I decided to benchmark every case of RE usage I had on a 8000-line secure (and fast) Webmail project here at Cynergi to check it out.
The results? Inconclusive! Sometimes PCRE *are* faster (sometimes by a factor greater than 100x faster!), but some other times POSIX RE are faster (by a factor of 2x).
I still have to find a rule on when are one or the other faster. It's not only about search data size, amount of data matched, or "RE compilation time" which would show when you repeated the function often: one would *always* be faster than the other. But I didn't find a pattern here. But truth be said, I also didn't take the time to look into the source code and analyse the problem.
I can give you some examples, though. The POSIX RE
([0-9]{4})/([0-9]{2})/([0-9]{2})[^0-9]+
([0-9]{2}):([0-9]{2}):([0-9]{2})
is 30% faster in POSIX than when converted to PCRE (even if you use \d and \D and non-greedy matching). On the other hand, a similarly PCRE complex pattern
/[0-9]{1,2}[ \t]+[a-zA-Z]{3}[ \t]+[0-9]{4}[ \t]+[0-9]{1,2}:[0-9]{1,2}(:[0-9]{1,2})?[ \t]+[+-][0-9]{4}/
is 2.5x faster in PCRE than in POSIX RE. Simple replacement patterns like
ereg_replace( "[^a-zA-Z0-9-]+", "", $m );
are 2x faster in POSIX RE than PCRE. And then we get confused again because a POSIX RE pattern like
(^|\n|\r)begin-base64[ \t]+[0-7]{3,4}[ \t]+......
is 2x faster as POSIX RE, but the case-insensitive PCRE
/^Received[ \t]*:[ \t]*by[ \t]+([^ \t]+)[ \t]/i
is 30x faster than its POSIX RE version!
When it comes to case sensitivity, PCRE has so far seemed to be the best option. But I found some really strange behaviour from ereg/eregi. On a very simple POSIX RE
(^|\r|\n)mime-version[ \t]*:
I found eregi() taking 3.60s (just a number in a test benchmark), while the corresponding PCRE took 0.16s! But if I used ereg() (case-sensitive) the POSIX RE time went down to 0.08s! So I investigated further. I tried to make the POSIX RE case-insensitive itself. I got as far as this:
(^|\r|\n)[mM][iI][mM][eE]-vers[iI][oO][nN][ \t]*:
This version also took 0.08s. But if I try to apply the same rule to any of the 'v', 'e', 'r' or 's' letters that are not changed, the time is back to the 3.60s mark, and not gradually, but immediatelly so! The test data didn't have any "vers" in it, other "mime" words in it or any "ion" that might be confusing the POSIX parser, so I'm at a loss.
Bottom line: always benchmark your PCRE / POSIX RE to find the fastest!
Tests were performed with PHP 5.1.2 under Windows, from the command line.
Pedro Freire
cynergi.com
22-Sep-2005 06:50
There's a printable PDF PCRE cheat sheet available here:
http://www.phpguru.org/article.php?ne_id=67
Has the common metacharacters, quantifiers, pattern modifiers, character classes and assertions with short explanations.
24-Oct-2004 01:08
If you want to perform regular expressions on Unicode strings, the PCRE functions will NOT be of any help. You need to use the Multibyte extension : mb_ereg(), mb_eregi(), pb_ereg_replace() and so on. When doing so, be carefull to set the default text encoding to the same encoding used by the text you are searching and replacing in. You can do that with the mb_regex_encoding() function. You will probably also want to set the default encoding for the other mb_* string functions with mb_internal_encoding().
So when dealing with, say, french text, I start with these :
<?php
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');
setlocale(LC_ALL, 'fr-fr');
?>
20-Jul-2004 12:17
Something to bear in mind is that regex is actually a declarative programming language like prolog : your regex is a set of rules which the regex interpreter tries to match against a string. During this matching, the interpreter will assume certain things, and continue assuming them until it comes up against a failure to match, which then causes it to backtrack. Regex assumes "greedy matching" unless explicitly told not to, which can cause a lot of backtracking. A general rule of thumb is that the more backtracking, the slower the matching process.
It is therefore vital, if you are trying to optimise your program to run quickly (and if you can't do without regex), to optimise your regexes to match quickly.
I recommend the use of a tool such as "The Regex Coach" to debug your regex strings.
http://weitz.de/files/regex-coach.exe (Windows installer) http://weitz.de/files/regex-coach.tgz (Linux tar archive)
21-Sep-2003 04:00
Regular Expressions Tutorial from non PHP sites
http://www.amk.ca/python/howto/regex/
http://sitescooper.org/tao_regexps.html
http://www.english.uga.edu/humcomp/perl/regex2a.html
http://www.english.uga.edu/humcomp/perl/regexps.html
http://www.english.uga.edu/humcomp/perl/regular_expressions.HTML
http://www.english.uga.edu/humcomp/perl/
http://java.sun.com/docs/books/tutorial/extra/regex/
http://gnosis.cx/publish/programming/regular_expressions.html
http://www.zvon.org/other/PerlTutorial/Books/Book1/
http://it.metr.ou.edu/regex/
http://www.regular-expressions.info/
