Recursive patterns
Consider the problem of matching a string in parentheses,
allowing for unlimited nested parentheses. Without the use
of recursion, the best that can be done is to use a pattern
that matches up to some fixed depth of nesting. It is not
possible to handle an arbitrary nesting depth. Perl 5.6 has
provided an experimental facility that allows regular
expressions to recurse (among other things). The special
item (?R) is provided for the specific case of recursion.
This PCRE pattern solves the parentheses problem (assume
the PCRE_EXTENDED
option is set so that white space is
ignored):
\( ( (?>[^()]+) | (?R) )* \)
First it matches an opening parenthesis. Then it matches any
number of substrings which can either be a sequence of
non-parentheses, or a recursive match of the pattern itself
(i.e. a correctly parenthesized substring). Finally there is
a closing parenthesis.
This particular example pattern contains nested unlimited
repeats, and so the use of a once-only subpattern for matching
strings of non-parentheses is important when applying
the pattern to strings that do not match. For example, when
it is applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
it yields "no match" quickly. However, if a once-only subpattern
is not used, the match runs for a very long time
indeed because there are so many different ways the + and *
repeats can carve up the subject, and all have to be tested
before failure can be reported.
The values set for any capturing subpatterns are those from
the outermost level of the recursion at which the subpattern
value is set. If the pattern above is matched against
(ab(cd)ef)
the value for the capturing parentheses is "ef", which is
the last value taken on at the top level. If additional
parentheses are added, giving
\( ( ( (?>[^()]+) | (?R) )* ) \)
then the string they capture
is "ab(cd)ef", the contents of the top level parentheses. If
there are more than 15 capturing parentheses in a pattern,
PCRE has to obtain extra memory to store data during a
recursion, which it does by using pcre_malloc, freeing it
via pcre_free afterwards. If no memory can be obtained, it
saves data for the first 15 capturing parentheses only, as
there is no way to give an out-of-memory error from within a
recursion.
(?1)
, (?2)
and so on
can be used for recursive subpatterns too. It is also possible to use named
subpatterns: (?P>name)
or
(?&name)
.
If the syntax for a recursive subpattern reference (either by number or
by name) is used outside the parentheses to which it refers, it operates
like a subroutine in a programming language. An earlier example
pointed out that the pattern
(sens|respons)e and \1ibility
matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If instead the pattern
(sens|respons)e and (?1)ibility
is used, it does match "sense and responsibility" as well as the other
two strings. Such references must, however, follow the subpattern to
which they refer.
The maximum length of a subject string is the largest positive number
that an integer variable can hold. However, PCRE uses recursion to
handle subpatterns and indefinite repetition. This means that the
available stack space may limit the size of a subject string that can be
processed by certain patterns.