Recently I had to write search engine in hebrew and ran into huge amount of problems. My data was stored in MySQL table with utf8_bin encoding.
So, to be able to write hebrew in utf8 table you need to do
<?php
$prepared_text = addslashes(urf8_encode($text));
?>
But then I had to find if some word exists in stored text. This is the place I got stuck. Simple preg_match would not find text since hebrew doesnt work that easy. I've tried with /u and who kows what else.
Solution was somewhat logical and simple...
<?php
$db_text = bin2hex(stripslashes(utf8_decode($db_text)));
$word = bin2hex($word);
$found = preg_match_all("/($word)+/i", $db_text, $matches);
?>
I've used preg_match_all since it returns number of occurences. So I could sort search results acording to that.
Hope someone finds this useful!
preg_match_all
(PHP 4, PHP 5)
preg_match_all — Perform a global regular expression match
Description
Searches subject for all matches to the regular expression given in pattern and puts them in matches in the order specified by flags .
After the first match is found, the subsequent searches are continued on from end of the last match.
Parameters
- pattern
-
The pattern to search for, as a string.
- subject
-
The input string.
- matches
-
Array of all matches in multi-dimensional array ordered according to flags .
- flags
-
Can be a combination of the following flags (note that it doesn't make sense to use PREG_PATTERN_ORDER together with PREG_SET_ORDER):
- PREG_PATTERN_ORDER
-
Orders results so that $matches[0] is an array of full pattern matches, $matches[1] is an array of strings matched by the first parenthesized subpattern, and so on.
<?php
preg_match_all("|<[^>]+>(.*)</[^>]+>|U",
"<b>example: </b><div align=left>this is a test</div>",
$out, PREG_PATTERN_ORDER);
echo $out[0][0] . ", " . $out[0][1] . "\n";
echo $out[1][0] . ", " . $out[1][1] . "\n";
?>The above example will output:
<b>example: </b>, <div align=left>this is a test</div> example: , this is a test
So, $out[0] contains array of strings that matched full pattern, and $out[1] contains array of strings enclosed by tags.
- PREG_SET_ORDER
-
Orders results so that $matches[0] is an array of first set of matches, $matches[1] is an array of second set of matches, and so on.
<?php
preg_match_all("|<[^>]+>(.*)</[^>]+>|U",
"<b>example: </b><div align=\"left\">this is a test</div>",
$out, PREG_SET_ORDER);
echo $out[0][0] . ", " . $out[0][1] . "\n";
echo $out[1][0] . ", " . $out[1][1] . "\n";
?>The above example will output:
<b>example: </b>, example: <div align="left">this is a test</div>, this is a test
- PREG_OFFSET_CAPTURE
-
If this flag is passed, for every occurring match the appendant string offset will also be returned. Note that this changes the value of matches in an array where every element is an array consisting of the matched string at offset 0 and its string offset into subject at offset 1.
If no order flag is given, PREG_PATTERN_ORDER is assumed.
- offset
-
Normally, the search starts from the beginning of the subject string. The optional parameter offset can be used to specify the alternate place from which to start the search (in bytes).
Note: Using offset is not equivalent to passing substr($subject, $offset) to preg_match_all() in place of the subject string, because pattern can contain assertions such as ^, $ or (?<=x). See preg_match() for examples.
Return Values
Returns the number of full pattern matches (which might be zero), or FALSE if an error occurred.
ChangeLog
| Version | Description |
|---|---|
| 4.3.3 | The offset parameter was added |
| 4.3.0 | The PREG_OFFSET_CAPTURE flag was added |
Examples
Example #1 Getting all phone numbers out of some text.
<?php
preg_match_all("/\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4}/x",
"Call 555-1212 or 1-800-555-1212", $phones);
?>
Example #2 Find matching HTML tags (greedy)
<?php
// The \\2 is an example of backreferencing. This tells pcre that
// it must match the second set of parentheses in the regular expression
// itself, which would be the ([\w]+) in this case. The extra backslash is
// required because the string is in double quotes.
$html = "<b>bold text</b><a href=howdy.html>click me</a>";
preg_match_all("/(<([\w]+)[^>]*>)(.*)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER);
foreach ($matches as $val) {
echo "matched: " . $val[0] . "\n";
echo "part 1: " . $val[1] . "\n";
echo "part 2: " . $val[3] . "\n";
echo "part 3: " . $val[4] . "\n\n";
}
?>
The above example will output:
matched: <b>bold text</b> part 1: <b> part 2: bold text part 3: </b> matched: <a href=howdy.html>click me</a> part 1: <a href=howdy.html> part 2: click me part 3: </a>
Example #3 Using named subpattern
<?php
$str = <<<FOO
a: 1
b: 2
c: 3
FOO;
preg_match_all('/(?<name>\w+): (?<digit>\d+)/', $str, $matches);
print_r($matches);
?>
The above example will output:
Array ( [0] => Array ( [0] => a: 1 [1] => b: 2 [2] => c: 3 ) [name] => Array ( [0] => a [1] => b [2] => c ) [1] => Array ( [0] => a [1] => b [2] => c ) [digit] => Array ( [0] => 1 [1] => 2 [2] => 3 ) [2] => Array ( [0] => 1 [1] => 2 [2] => 3 ) )
preg_match_all
15-Oct-2008 02:56
07-Oct-2008 01:25
Here is a way to match everything on the page, performing an action for each match as you go. I had used this idiom in other languages, where its use is customary, but in PHP it seems to be not quite as common.
<?php
function custom_preg_match_all($pattern, $subject)
{
$offset = 0;
$match_count = 0;
while(preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, $offset))
{
// Increment counter
$match_count++;
// Get byte offset and byte length (assuming single byte encoded)
$match_start = $matches[0][1];
$match_length = strlen(matches[0][0]);
// (Optional) Transform $matches to the format it is usually set as (without PREG_OFFSET_CAPTURE set)
foreach($matches as $k => $match) $newmatches[$k] = $match[0];
$matches = $new_matches;
// Your code here
echo "Match number $match_count, at byte offset $match_start, $match_length bytes long: ".$matches[0]."\r\n";
// Update offset to the end of the match
$offset = $match_start + $match_length;
}
return $match_count;
}
?>
Note that the offsets returned are byte values (not necessarily number of characters) so you'll have to make sure the data is single-byte encoded. (Or have a look at paolo mosna's strByte function on the strlen manual page).
I'd be interested to know how this method performs speedwise against using preg_match_all and then recursing through the results.
19-Jun-2008 01:46
Perhaps you want to find the positions of all anchor tags. This will return a two dimensional array of which the starting and ending positions will be returned.
<?php
function getTagPositions($strBody)
{
define(DEBUG, false);
define(DEBUG_FILE_PREFIX, "/tmp/findlinks_");
preg_match_all("/<[^>]+>(.*)<\/[^>]+>/U", $strBody, $strTag, PREG_PATTERN_ORDER);
$intOffset = 0;
$intIndex = 0;
$intTagPositions = array();
foreach($strTag[0] as $strFullTag) {
if(DEBUG == true) {
$fhDebug = fopen(DEBUG_FILE_PREFIX.time(), "a");
fwrite($fhDebug, $fulltag."\n");
fwrite($fhDebug, "Starting position: ".strpos($strBody, $strFullTag, $intOffset)."\n");
fwrite($fhDebug, "Ending position: ".(strpos($strBody, $strFullTag, $intOffset) + strlen($strFullTag))."\n");
fwrite($fhDebug, "Length: ".strlen($strFullTag)."\n\n");
fclose($fhDebug);
}
$intTagPositions[$intIndex] = array('start' => (strpos($strBody, $strFullTag, $intOffset)), 'end' => (strpos($strBody, $strFullTag, $intOffset) + strlen($strFullTag)));
$intOffset += strlen($strFullTag);
$intIndex++;
}
return $intTagPositions;
}
$strBody = 'I have lots of <a href="http://my.site.com">links</a> on this <a href="http://my.site.com">page</a> that I want to <a href="http://my.site.com">find</a> the positions.';
$strBody = strip_tags(html_entity_decode($strBody), '<a>');
$intTagPositions = getTagPositions($strBody);
print_r($intTagPositions);
/*****
Output:
Array (
[0] => Array (
[start] => 15
[end] => 53 )
[1] => Array (
[start] => 62
[end] => 99 )
[2] => Array (
[start] => 115
[end] => 152 )
)
*****/
?>
04-Mar-2008 12:13
To count str_length in UTF-8 string i use
$count = preg_match_all("/[[:print:]\pL]/u", $str, $pockets);
where
[:print:] - printing characters, including space
\pL - UTF-8 Letter
/u - UTF-8 string
other unicode character properties on http://www.pcre.org/pcre.txt
28-Jan-2008 04:30
please note, that the function of "mail at SPAMBUSTER at milianw dot de" can result in invalid xhtml in some cases. think i used it in the right way but my result is sth like this:
<img src="./img.jpg" alt="nice picture" />foo foo foo foo </img>
correct me if i'm wrong.
i'll see when there's time to fix that. -.-
12-Jul-2007 02:57
<?php
// Returns an array of strings where the start and end are found
function findinside($start, $end, $string) {
preg_match_all('/' . preg_quote($start, '/') . '([^\.)]+)'. preg_quote($end, '/').'/i', $string, $m);
return $m[1];
}
$start = "mary has";
$end = "lambs.";
$string = "mary has 6 lambs. phil has 13 lambs. mary stole phil's lambs. now mary has all the lambs.";
$out = findinside($start, $end, $string);
print_r ($out);
/* Results in
(
[0] => 6
[1] => all the
)
*/
?>
26-Jun-2007 11:22
If you'd like to include DOUBLE QUOTES on a regular expression for use with preg_match_all, try ESCAPING THRICE, as in: \\\"
For example, the pattern:
'/<table>[\s\w\/<>=\\\"]*<\/table>/'
Should be able to match:
<table>
<row>
<col align="left" valign="top">a</col>
<col align="right" valign="bottom">b</col>
</row>
</table>
.. with all there is under those table tags.
I'm not really sure why this is so, but I tried just the double quote and one or even two escape characters and it won't work. In my frustration I added another one and then it's cool.
06-Dec-2006 06:20
This is a function to convert byte offsets into (UTF-8) character offsets (this is reagardless of whether you use /u modifier:
<?php
function mb_preg_match_all($ps_pattern, $ps_subject, &$pa_matches, $pn_flags = PREG_PATTERN_ORDER, $pn_offset = 0, $ps_encoding = NULL) {
// WARNING! - All this function does is to correct offsets, nothing else:
//
if (is_null($ps_encoding))
$ps_encoding = mb_internal_encoding();
$pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
$ret = preg_match_all($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);
if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))
foreach($pa_matches as &$ha_match)
foreach($ha_match as &$ha_match)
$ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding);
//
// (code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)
return $ret;
}
?>
20-Feb-2006 12:53
Here's some fleecy code to 1. validate RCF2822 conformity of address lists and 2. to extract the address specification (the part commonly known as 'email'). I wouldn't suggest using it for input form email checking, but it might be just what you want for other email applications. I know it can be optimized further, but that part I'll leave up to you nutcrackers. The total length of the resulting Regex is about 30000 bytes. That because it accepts comments. You can remove that by setting $cfws to $fws and it shrinks to about 6000 bytes. Conformity checking is absolutely and strictly referring to RFC2822. Have fun and email me if you have any enhancements!
<?php
function mime_extract_rfc2822_address($string)
{
//rfc2822 token setup
$crlf = "(?:\r\n)";
$wsp = "[\t ]";
$text = "[\\x01-\\x09\\x0B\\x0C\\x0E-\\x7F]";
$quoted_pair = "(?:\\\\$text)";
$fws = "(?:(?:$wsp*$crlf)?$wsp+)";
$ctext = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F" .
"!-'*-[\\]-\\x7F]";
$comment = "(\\((?:$fws?(?:$ctext|$quoted_pair|(?1)))*" .
"$fws?\\))";
$cfws = "(?:(?:$fws?$comment)*(?:(?:$fws?$comment)|$fws))";
//$cfws = $fws; //an alternative to comments
$atext = "[!#-'*+\\-\\/0-9=?A-Z\\^-~]";
$atom = "(?:$cfws?$atext+$cfws?)";
$dot_atom_text = "(?:$atext+(?:\\.$atext+)*)";
$dot_atom = "(?:$cfws?$dot_atom_text$cfws?)";
$qtext = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F!#-[\\]-\\x7F]";
$qcontent = "(?:$qtext|$quoted_pair)";
$quoted_string = "(?:$cfws?\"(?:$fws?$qcontent)*$fws?\"$cfws?)";
$dtext = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F!-Z\\^-\\x7F]";
$dcontent = "(?:$dtext|$quoted_pair)";
$domain_literal = "(?:$cfws?\\[(?:$fws?$dcontent)*$fws?]$cfws?)";
$domain = "(?:$dot_atom|$domain_literal)";
$local_part = "(?:$dot_atom|$quoted_string)";
$addr_spec = "($local_part@$domain)";
$display_name = "(?:(?:$atom|$quoted_string)+)";
$angle_addr = "(?:$cfws?<$addr_spec>$cfws?)";
$name_addr = "(?:$display_name?$angle_addr)";
$mailbox = "(?:$name_addr|$addr_spec)";
$mailbox_list = "(?:(?:(?:(?<=:)|,)$mailbox)+)";
$group = "(?:$display_name:(?:$mailbox_list|$cfws)?;$cfws?)";
$address = "(?:$mailbox|$group)";
$address_list = "(?:(?:^|,)$address)+";
//output length of string (just so you see how f**king long it is)
echo(strlen($address_list) . " ");
//apply expression
preg_match_all("/^$address_list$/", $string, $array, PREG_SET_ORDER);
return $array;
};
?>
02-Feb-2006 10:05
PREG_OFFSET_CAPTURE always seems to provide byte offsets, rather than character position offsets, even when you are using the unicode /u modifier.
