In answer to "me at codingbear.com"
You can achieve the same result in a way that at my sense is easier.
However, If you have a custom error handler already set $php_errormsg wont be populated and you will have to retreive your last error from your custom error handler.
<?php
/**
* The RegEx validation class .
*
* Usage:
* if (!RegEx::isValid($expression)) {
* echo 'Your regular expression is invalid because: ' . RegEx::error();
* }
*/
class RegEx {
/**
* Validates a regular expression. Returns TRUE
* if the expression is valid, FALSE if not. If
* the expression is not valid, the reason why
* can be fetched from RegEx::error().
*
* @access public
* @static
* @param string $regex Regular Expression
* @return bool
*/
function isValid($regex)
{
self::error(FALSE);
$nbMatch = FALSE;
$nbMatch = @preg_match($regex, '');
if($nbMatch === FALSE){
self::error(substr($php_errormsg, 14));
return FALSE;
}
return TRUE;
}
/**
* Holds the error from the last validation check.
*
* The first parameter is used internally and should
* not be used by the developer.
*
* @access public
* @static
* @param FALSE|string $value value to set for $flag
* @return FALSE|string
*/
function error($value = NULL)
{
static $flag = FALSE;
if (!is_null($value))
{
$flag = $value;
}
return $flag;
}
}
?>
CXIX. Regular Expression Functions (Perl-Compatible)
Úvod
The syntax for patterns used in these functions closely resembles Perl. The expression should be enclosed in the delimiters, a forward slash (/), for example. Any character can be used for delimiter as long as it's not alphanumeric or backslash (\). If the delimiter character has to be used in the expression itself, it needs to be escaped by backslash. Since PHP 4.0.4, you can also use Perl-style (), {}, [], and <> matching delimiters. See Pattern Syntax for detailed explanation.
The ending delimiter may be followed by various modifiers that affect the matching. See Pattern Modifiers.
PHP also supports regular expressions using a POSIX-extended syntax using the POSIX-extended regex functions.
Poznámka: This extension maintains a global per-thread cache of compiled regular expressions (up to 4096).
You should be aware of some limitations of PCRE. Read » http://www.pcre.org/pcre.txt for more info.
Požadavky
Tyto funkce jsou k dispozici jako součást standardního modulu, který je vždy dostupný.
Instalace
Beginning with PHP 4.2.0 these functions are enabled by default. You can disable the pcre functions with --without-pcre-regex. Use --with-pcre-regex=DIR to specify DIR where PCRE's include and library files are located, if not using bundled library. For older versions you have to configure and compile PHP with --with-pcre-regex[=DIR] in order to use these functions.
Verze PHP pro Windows má vestavěnou podporu pro toto rozšíření. K použití těchto funkcí není třeba načítat žádná další rozšíření.
Konfigurace běhu
Chování těchto funkcí je ovlivněno nastavením parametrů v php.ini.
Tabulka 218. PCRE Configuration Options
| Name | Default | Changeable | Changelog |
|---|---|---|---|
| pcre.backtrack_limit | 100000 | PHP_INI_ALL | Available since PHP 5.2.0. |
| pcre.recursion_limit | 100000 | PHP_INI_ALL | Available since PHP 5.2.0. |
Pro další detaily a definice konstant PHP_INI_*, viz dokumentace k ini_set().
Zde je stručný popis konfiguračních direktiv.
- pcre.backtrack_limit integer
PCRE's backtracking limit.
- pcre.recursion_limit integer
PCRE's recursion limit. Please note that if you set this value to a high number you may consume all the available process stack and eventually crash PHP (due to reaching the stack size limit imposed by the Operating System).
Typy prostředků
Toto rozšíření nemá definován žádný typ prostředku (resource).
Předdefinované konstanty
Tyto konstanty jsou definovány tímto rozšířením a budou k dispozici pouze tehdy, bylo-li rozšíření zkompilováno společně s PHP nebo dynamicky zavedeno za běhu.
Tabulka 219. PREG constants
| constant | description |
|---|---|
| PREG_PATTERN_ORDER | Orders results so that $matches[0] is an array of full pattern matches, $matches[1] is an array of strings matched by the first parenthesized subpattern, and so on. This flag is only used with preg_match_all(). |
| PREG_SET_ORDER | Orders results so that $matches[0] is an array of first set of matches, $matches[1] is an array of second set of matches, and so on. This flag is only used with preg_match_all(). |
| PREG_OFFSET_CAPTURE | See the description of PREG_SPLIT_OFFSET_CAPTURE. This flag is available since PHP 4.3.0. |
| PREG_SPLIT_NO_EMPTY | This flag tells preg_split() to return only non-empty pieces. |
| PREG_SPLIT_DELIM_CAPTURE | This flag tells preg_split() to capture parenthesized expression in the delimiter pattern as well. This flag is available since PHP 4.0.5. |
| PREG_SPLIT_OFFSET_CAPTURE | If this flag is set, for every occurring match the appendant string offset will also be returned. Note that this changes the return values in an array where every element is an array consisting of the matched string at offset 0 and its string offset within subject at offset 1. This flag is available since PHP 4.3.0 and is only used for preg_split(). |
| PREG_NO_ERROR | Returned by preg_last_error() if there were no errors. Available since PHP 5.2.0. |
| PREG_INTERNAL_ERROR | Returned by preg_last_error() if there was an internal PCRE error. Available since PHP 5.2.0. |
| PREG_BACKTRACK_LIMIT_ERROR | Returned by preg_last_error() if backtrack limit was exhausted. Available since PHP 5.2.0. |
| PREG_RECURSION_LIMIT_ERROR | Returned by preg_last_error() if recursion limit was exhausted. Available since PHP 5.2.0. |
| PREG_BAD_UTF8_ERROR | Returned by preg_last_error() if the last error was caused by malformed UTF-8 data (only when running a regex in UTF-8 mode). Available since PHP 5.2.0. |
Příklady
Příklad 1440. Examples of invalid patterns
- /href='(.*)' - missing ending delimiter
- /\w+\s*\w+/J - unknown modifier 'J'
- 1-\d3-\d3-\d4| - missing starting delimiter
Obsah
- Pattern Modifiers — Describes possible modifiers in regex patterns
- Pattern Syntax — Describes PCRE regex syntax
- preg_grep — Return array entries that match the pattern
- preg_last_error — Returns the error code of the last PCRE regex execution
- preg_match_all — Perform a global regular expression match
- preg_match — Perform a regular expression match
- preg_quote — Quote regular expression characters
- preg_replace_callback — Perform a regular expression search and replace using a callback
- preg_replace — Perform a regular expression search and replace
- preg_split — Split string by a regular expression
Regular Expression Functions (Perl-Compatible)
09-Apr-2008 07:39
04-Jan-2008 07:58
While developing an application I needed to take user-entered regular expressions and validate them before use. Since I was unable to find a built-in function for doing so, I wrote this piece of code.
It works by setting an error handler then using the provided regular expression on an empty string. If the regular expression is invalid then preg_match will create an error which is captured by the error handler. The error handler is then restored.
It is possible to generalize this for any function by modifying isValid() to accept a function/method name and the appropriate parameters instead.
For aesthetics, the preg_match(): part of the error is stripped out.
<?php
/**
* The RegEx validation class .
*
* Usage:
* if (!RegEx::isValid($expression)) {
* echo 'Your regular expression is invalid because: ' . RegEx::error();
* }
*/
class RegEx {
/**
* Validates a regular expression. Returns TRUE
* if the expression is valid, FALSE if not. If
* the expression is not valid, the reason why
* can be fetched from RegEx::error().
*
* @access public
* @static
* @param string $regex Regular Expression
* @return bool
*/
function isValid($regex)
{
RegEx::error(FALSE);
set_error_handler(array('RegEx', 'errorHandler'));
preg_match($regex, '');
restore_error_handler();
return (RegEx::error() === FALSE) ? TRUE : FALSE;
}
/**
* Error handler for RegEx. Used internally by RegEx::validate()
*
* @access package
* @static
* @param int $code Error Code
* @param string $message Error Message
*/
function errorHandler($code, $message)
{
// Cuts off the 'preg_match(): ' part of the error message.
$error = substr($message, 14);
// Sets the error flag with the message.
RegEx::error($error);
}
/**
* Holds the error from the last validation check.
*
* The first parameter is used internally and should
* not be used by the developer.
*
* @access public
* @static
* @param FALSE|string $value value to set for $flag
* @return FALSE|string
*/
function error($value = NULL)
{
static $flag = FALSE;
if (!is_null($value))
{
$flag = $value;
}
return $flag;
}
}
?>
16-Dec-2007 04:33
I am quiet astonished experiencing, that a point does NOT match umlauts and other 'special' characters, in spite of the fact, it is written in this documentation.
I could not fathom out yet, which characters also are affected, but it seams to be better everybody else takes care of that incompatible behavior to perl.
13-Sep-2007 01:42
One comment about 5.2.x and the pcre.backtrack_limit:
Note that this setting wasn't present under previous PHP releases and the behaviour (or limit) under those releases was, in practise, higher so all these PCRE functions were able to "capture" longer strings.
With the arrival of the setting, defaulting to 100000 (less than 100K), you won't be able to match/capture strings over that size using, for example "ungreedy" modifiers.
So, in a lot of situations, you'll need to raise that (very small IMO) limit.
The worst part is that PHP simply won't match/capture those strings over pcre.backtrack_limit and will it be 100% silent about that (I think that throwing some NOTICE/WARNING if raised could help a lot to developers).
There is a lot of people suffering this changed behaviour from I've read on forums, bugs and so on).
Hope this note helps, ciao :-)
05-May-2007 04:16
PCRE faster than POSIX RE? Not always.
In a recent search-engine project here at Cynergi, I had a simple loop with a few cute ereg_replace() functions that took 3min to process data. I changed that 10-line loop into a 100-line hand-written code for replacement and the loop now took 10s to process the same data! This opened my eye to what can *IN SOME CASES* be very slow regular expressions.
Lately I decided to look into Perl-compatible regular expressions (PCRE). Most pages claim PCRE are faster than POSIX, but a few claim otherwise. I decided on bechmarks of my own.
My first few tests confirmed PCRE to be faster, but... the results were slightly different than others were getting, so I decided to benchmark every case of RE usage I had on a 8000-line secure (and fast) Webmail project here at Cynergi to check it out.
The results? Inconclusive! Sometimes PCRE *are* faster (sometimes by a factor greater than 100x faster!), but some other times POSIX RE are faster (by a factor of 2x).
I still have to find a rule on when are one or the other faster. It's not only about search data size, amount of data matched, or "RE compilation time" which would show when you repeated the function often: one would *always* be faster than the other. But I didn't find a pattern here. But truth be said, I also didn't take the time to look into the source code and analyse the problem.
I can give you some examples, though. The POSIX RE
([0-9]{4})/([0-9]{2})/([0-9]{2})[^0-9]+
([0-9]{2}):([0-9]{2}):([0-9]{2})
is 30% faster in POSIX than when converted to PCRE (even if you use \d and \D and non-greedy matching). On the other hand, a similarly PCRE complex pattern
/[0-9]{1,2}[ \t]+[a-zA-Z]{3}[ \t]+[0-9]{4}[ \t]+[0-9]{1,2}:[0-9]{1,2}(:[0-9]{1,2})?[ \t]+[+-][0-9]{4}/
is 2.5x faster in PCRE than in POSIX RE. Simple replacement patterns like
ereg_replace( "[^a-zA-Z0-9-]+", "", $m );
are 2x faster in POSIX RE than PCRE. And then we get confused again because a POSIX RE pattern like
(^|\n|\r)begin-base64[ \t]+[0-7]{3,4}[ \t]+......
is 2x faster as POSIX RE, but the case-insensitive PCRE
/^Received[ \t]*:[ \t]*by[ \t]+([^ \t]+)[ \t]/i
is 30x faster than its POSIX RE version!
When it comes to case sensitivity, PCRE has so far seemed to be the best option. But I found some really strange behaviour from ereg/eregi. On a very simple POSIX RE
(^|\r|\n)mime-version[ \t]*:
I found eregi() taking 3.60s (just a number in a test benchmark), while the corresponding PCRE took 0.16s! But if I used ereg() (case-sensitive) the POSIX RE time went down to 0.08s! So I investigated further. I tried to make the POSIX RE case-insensitive itself. I got as far as this:
(^|\r|\n)[mM][iI][mM][eE]-vers[iI][oO][nN][ \t]*:
This version also took 0.08s. But if I try to apply the same rule to any of the 'v', 'e', 'r' or 's' letters that are not changed, the time is back to the 3.60s mark, and not gradually, but immediatelly so! The test data didn't have any "vers" in it, other "mime" words in it or any "ion" that might be confusing the POSIX parser, so I'm at a loss.
Bottom line: always benchmark your PCRE / POSIX RE to find the fastest!
Tests were performed with PHP 5.1.2 under Windows, from the command line.
Pedro Freire
cynergi.com
19-Feb-2006 02:19
I read this part, but i couldn't undertand a single word beacause before i must know Basic regular expression. Somebody put a link for PERL that is almost like PHP but here is one totally dedicated to PHP:
http://weblogtoolscollection.com/regex/regex.php
22-Sep-2005 11:50
There's a printable PDF PCRE cheat sheet available here:
http://www.phpguru.org/article.php?ne_id=67
Has the common metacharacters, quantifiers, pattern modifiers, character classes and assertions with short explanations.
23-Oct-2004 06:08
If you want to perform regular expressions on Unicode strings, the PCRE functions will NOT be of any help. You need to use the Multibyte extension : mb_ereg(), mb_eregi(), pb_ereg_replace() and so on. When doing so, be carefull to set the default text encoding to the same encoding used by the text you are searching and replacing in. You can do that with the mb_regex_encoding() function. You will probably also want to set the default encoding for the other mb_* string functions with mb_internal_encoding().
So when dealing with, say, french text, I start with these :
<?php
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');
setlocale(LC_ALL, 'fr-fr');
?>
20-Jul-2004 05:17
Something to bear in mind is that regex is actually a declarative programming language like prolog : your regex is a set of rules which the regex interpreter tries to match against a string. During this matching, the interpreter will assume certain things, and continue assuming them until it comes up against a failure to match, which then causes it to backtrack. Regex assumes "greedy matching" unless explicitly told not to, which can cause a lot of backtracking. A general rule of thumb is that the more backtracking, the slower the matching process.
It is therefore vital, if you are trying to optimise your program to run quickly (and if you can't do without regex), to optimise your regexes to match quickly.
I recommend the use of a tool such as "The Regex Coach" to debug your regex strings.
http://weitz.de/files/regex-coach.exe (Windows installer) http://weitz.de/files/regex-coach.tgz (Linux tar archive)
20-Sep-2003 09:00
Regular Expressions Tutorial from non PHP sites
http://www.amk.ca/python/howto/regex/
http://sitescooper.org/tao_regexps.html
http://www.english.uga.edu/humcomp/perl/regex2a.html
http://www.english.uga.edu/humcomp/perl/regexps.html
http://www.english.uga.edu/humcomp/perl/regular_expressions.HTML
http://www.english.uga.edu/humcomp/perl/
http://java.sun.com/docs/books/tutorial/extra/regex/
http://gnosis.cx/publish/programming/regular_expressions.html
http://www.zvon.org/other/PerlTutorial/Books/Book1/
http://it.metr.ou.edu/regex/
http://www.regular-expressions.info/
