PHP/String/Regular Expressions

Материал из Web эксперт
Перейти к: навигация, поиск

^ and $ are line anchors.

 
^     specifies the beginning of the line.
$     specifies the end of the line.



\b and \B, equate to "On a word boundary" and "Not on a word boundary," respectively.

 
<?
    $string = "this is a test!";
    if (preg_match("/oo\b/i", $string)) {
    }
    preg_match("/oo\B/i", $string);
    preg_match("/no\b/", "he said "no!"");
    preg_match("/royalty\b/", "royalty-free photograph");
?>



Brackets [] finds a range of characters.

 
Regexp [php] finds any string containing the character p or h. 
[0�9] matches any decimal digit from 0 through 9.
[a�z] matches any character from lowercase a through lowercase z.
[A�Z] matches any character from uppercase A through uppercase Z.
[a�Z] matches any character from lowercase a through uppercase Z.



Character Classes

 
[ indicates the beginning of a character class. 
- indicates a range inside a character class (unless it is first in the class). 
^ indicates a negated character class (if found first). 
] indicates the end of a character class.



Complete list of regular expression examples

 
Expression                    Will match . . .
 
foo                           The string "foo"
 
^foo                          "foo" at the start of a line
 
foo$                          "foo" at the end of a line
 
^foo$                         "foo" when it is alone on a line
 
[Ff]oo                        "Foo" or "foo"
 
[abc]                         a, b, or c
 
[^abc]                        d, e, f, g, V, %, ~, 5, etc.everything that is not a, b, or c (^ is "not" inside character classes)
 
[A-Z]                         Any uppercase letter
 
[a-z]                         Any lowercase letter
 
[A-Za-z]                      Any letter
 
[A-Za-z0-9]                   Any letter or number
 
[A-Z]+                        One or more uppercase letters
 
[A-Z]*                        Zero or more uppercase letters
 
[A-Z]?                        Zero or one uppercase letters
 
[A-Z]{3}                      Three uppercase letters
 
[A-Z]{3,}                     A minimum of three uppercase letters
 
[A-Z]{1,3}                    One, two, or three uppercase letters
 
[^0-9]                        Any non-numeric character
 
[^0-9A-Za-z]                  Any symbol (not a number or a letter)
 
(cat|sat)                     Matches either "cat" or "sat"
 
([A-Z]{3}|[0-9]{4})           Matches three letters or four numbers
 
Fo*                           F, Fo, Foo, Fooo, Foooo, etc.
 
Fo+                           Fo, Foo, Fooo, Foooo, etc.
 
Fo?                           F, Fo
 
.                             Any character except \n (new line)
 
\b                            A word boundary; e.g. te\b matches the "te" in "late" but not the "te" in "tell."
 
\B                            A non-word boundary; "te\B" matches the "te" in "tell" but not the "te" in "late."
 
\n                            Newline character
 
\s                            Any whitespace (new line, space, tab, etc.)
 
\S                            Any non-whitespace character



Define a pattern and use parentheses to match individual elements within it

 
<?
$test = "Whatever you do, don"t panic!";
if ( preg_match( "/(don"t)\s+(panic)/", $test, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}
?>



Greedy and non-greedy matching

 
<?
$meats = "<b>Chicken</b>, <b>Beef</b>, <b>Duck</b>";
preg_match_all("@<b>.*?</b>@",$meats,$matches);
foreach ($matches[0] as $meat) {
    print "Meat A: $meat\n";
}
preg_match_all("@<b>.*</b>@",$meats,$matches);
foreach ($matches[0] as $meat) {
    print "Meat B: $meat\n";
}
?>



Greedy Qualifiers

 
Qualifier What It Matches 
* The preceding expression can be found any number of times, including one. 
+ The preceding expression can be found one or more times. 
? The preceding expression can be found at most once.



Greedy versus nongreedy matching

 
<?php
$html = "<em>love</em> you <em>.</em>";
// Greedy
$matchCount = preg_match_all("@<em>.+</em>@", $html, $matches);
print "Greedy count: " . $matchCount . "\n";
// Nongreedy
$matchCount = preg_match_all("@<em>.+?</em>@", $html, $matches);
print "First non-greedy count: " . $matchCount . "\n";
// Nongreedy
$matchCount = preg_match_all("@<em>.+</em>@U", $html, $matches);
print "Second non-greedy count: " . $matchCount . "\n";
?>



Grouping captured subpatterns

 
<?php
$todo = "
first=a
next=B
last=C
";
preg_match_all("/([a-zA-Z]+)=(.*)/", $todo, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
    print "The {$match[1]} action is {$match[2]} \n";
}
?>



Line Anchors

 
^ specifies the beginning of the line. 
$ specifies the end of the line.



Match an IP address

 
<?
$test = "156.152.55.35";
if ( preg_match( "/(\d+)\.(\d+)\.(\d+)\.(\d+)/", $test, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}
?>



Matching a Valid E-mail Address

 
<?php 
$regex = "/^[\w\d!#$%&"*+-\/=?^"{|}~]+(\.[\w\d!#$%&"*+-\/=?^"{|}~]+)*@([a-z\d][-a-z\d]*[a-z\d]\.)+[a-z][-a-z\d]*[a-z]$/"; 
$values = array( 
"user@example.ru",
"user@example"
); 
foreach ($values as $value) { 
    if (preg_match($regex, $value)) { 
        printf("Found valid address: %s\n", $value); 
    } else { 
        printf("INVALID address: %s\n", $value); 
    } 
} 
?>



Matching a Valid IP Address

 
<?php 
$good_ip = "192.168.0.1"; 
$bad_ip = "1.334.10.10"; 
$regex = "^(([1-9]?[0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]).){3}.([1-9]?[0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$" ;
if (ereg($regex, $good_ip)) { 
    echo """ . $good_ip . "" is a valid ip address.\n"; 
} else { 
    echo """ . $good_ip . "" is an INVALID ip address.\n"; 
} 
if (ereg($regex, $bad_ip)) { 
    echo """ . $bad_ip . "" is a valid ip address.\n"; 
} else {
    echo """ . $bad_ip . "" is a INVALID ip address.\n"; 
} 
?>



Matching GUIDs/UUIDs

 
<?php 
$uuid = "1111-1111-1111-1111"; 
function printResults($str) { 
if (eregi("^[0-9a-f]{8}(-[0-9a-f]{4}){3}-[0-9a-f]{12}$", $str)) { 
    printf(""%s" is a valid GUID/UUID.\n", $str); 
} else { 
    printf(""%s" is NOT a valid GUID/UUID.\n", $str); 
} 
} 
printResults($uuid); 
?>



Matching using backreferences

 
<?
$ok_html  = "I <b>love</b> shrimp dumplings.";
if (preg_match("@<[bi]>.*?</[bi]>@",$ok_html)) {
    print "Good for you! (OK, No backreferences)\n";
}
if (preg_match("@<([bi])>.*?</\\1>@",$ok_html)) {
    print "Good for you! (OK, Backreferences)\n";
}
?>



Matching with |

 
<?php
$text = "The files are c.gif, r.pdf, and e.jpg.";
if (preg_match_all("/[a-zA-Z0-9]+\.(gif|jpe?g)/",$text,$matches)) {
    print "The image files are: " . implode(",",$matches[0]);
}
?>



Matching with character classes and anchors

 
<?php
$thisFileContents = file_get_contents(__FILE__);
$matchCount = preg_match_all("/\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*/",$thisFileContents, $matches);
print "Matches: $matchCount\n";
foreach ($matches[0] as $variableName) {
    print "$variableName\n";
}
?>



Matching with Greedy vs. Nongreedy Expressions

 
<?php 
$teststring = ""Hello" and "Goodbye.""; 
$greedyresult = preg_replace("/".*"/", ""***"", $teststring); 
$nongreedyresult = preg_replace("/".*?"/", ""***"", $teststring); 
echo "Original: $teststring\n"; 
echo "Greedy Replace: $greedyresult\n"; 
echo "Nongreedy Replace: $nongreedyresult\n"; 
?>



Match the smallest number of characters starting with "p" and ending with "t"

 
<?
$text = "pot post pat patent";
if ( preg_match( "/p.*?t/", $text, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}
?>



Match URL

 
<?php
  $hostRegex = "([a-z\d][-a-z\d]*[a-z\d]\.)*[a-z][-a-z\d]*[a-z]";
  $portRegex = "(:\d{1,})?";
  $pathRegex = "(\/[^\s?]+)?";
  $queryRegex = "(\?[^<>#\"\s]+)?";
  $urlRegex = "/(?:(?<=^)|(?<=\s))((ht|f)tps?:\/\/" . $hostRegex . $portRegex . $pathRegex . $queryRegex . ")/";
  $str = "This is my homepage:  http://home.example.ru.";
  $str2 = "This is my homepage:  http://home.example.ru:8181/index.php";
  $sample1 = preg_replace($urlRegex, "<a href=\"\\1\">\\1</a>", $str);
  $sample2 = preg_replace($urlRegex, "<a href=\"\\1\">\\1</a>", $str2);
  echo $sample1 . "\n";
  echo $sample2 . "\n";
?>



Nongreedy Qualifiers

 
Qualifier           What It Matches
*?                  The preceding expression can be found any number of times, but the matching will stop as soon as it can.
+*                  The preceding expression can be found one or more times, but the matching will stop as soon as it can.



Option patterns:

 
(pattern) = Groups the pattern to act as one item and captures it
    
    
    
    (x|y) = Matches either pattern x, or pattern y
    
    
    
    [abc] = Matches either the character a, b, or c
    
    
    
    [^abc] = Matches any character except a, b, or c
    
    
    
    [a-f] = Matches characters a through f



Pattern matches:

 
\d = Digit
    
    
    
    \D = Not a digit
    
    
    
    \s = Whitespace
    
    
    
    \S = Not whitespace
    
    
    
    . = Any character (except \n)
    
    
    
    ^ = Start of string
    
    
    
    $ = End of string
    
    
    
    \b = Word boundary



Pattern match extenders:

 
? = Previous item is match 0 or 1 times.
    
    
    
    * = Previous item is matched 0 or more times.
    
    
    
    + = Previous item is matched 1 or more times.
    
    
    
    {n} = Previous item is matched exactly n times.
    
    
    
    {n,} = Previous item is matched at least n times.
    
    
    
    {n,m} = Previous item is matched at least n and at most m times.
    
    
    
    ? (after any of above) = Match as few as possible times.



Perl-Compatible Regular Expressions (PCRE)

 
\w represents a "word" character and is equivalent to the expression [A-Za-z0-9].
    
    \W represents the opposite of \w and is equivalent to [^A-Za-z0-9].
    
    \s represents a whitespace character.
    
    \S represents a nonwhitespace character.
    
    \d represents a digit and is equivalent to [0-9].
    
    \D represents a nondigit character and is equivalent to [^0-9].
    
    \n represents a newline character.
    
    \r represents a return character.
    
    \t represents a tab character.



POSIX Regular Expressions Character Classes

 
Expression            Meaning
[[:alpha:]]           A letter, such as A?Z or a?z
[[:digi:]]            A number 0?9
[[:space:]]           Whitespace, such as a tab or space character
4.60\< or \>          Word boundaries



Predefined Character Ranges (Character Classes)

 
[[:alpha:]] matches any string containing alphabetic characters aA through zZ.
[[:digit:]] matches any string containing numerical digits 0 through 9.
[[:alnum:]] matches any string containing alphanumeric characters aA through zZ and 0 through 9.
[[:space:]] matches any string containing a space.



Qualifiers restrict the number of times the preceding expression may appear.

 
The common single-character qualifiers are ?, +, and *.
?  means "zero or one," 
+  means "one or more." 
*  means "zero or more."



Quantifiers for Matching a Recurring Character

 
Symbol         Description                                      Example
 
*              Zero or more instances                           a*
 
+              One or more instances                            a+
 
?              Zero or one instance                             a?
 
{n}            n instances                                      a{3}
 
{n,}           At least n instances                             a{3,}
 
{,n}           Up to n instances                                a{,2}
 
{n1, n2}       At least n1 instances, no more than n2 instances a{1,2}



Quantifiers: +, *, ?, {int. range}, and $ follow a character sequence:

 
p+ matches any string containing at least one p.
p* matches any string containing zero or more p"s.
p? matches any string containing zero or more p"s. This is just an alternative way to use p*.
p{2} matches any string containing a sequence of two p"s.
p{2,3} matches any string containing a sequence of two or three p"s.
p{2, } matches any string containing a sequence of at least two p"s.
p$ matches any string with p at the end of it.
^p matches any string with p at the beginning of it.
[^a?zA?Z] matches any string not containing any of the characters ranging from a through z and A through Z.
p.p matches any string containing p, followed by any character, in turn followed by another p.



Ranges

 
{ specifies the beginning of a range. 
} specifies the end of a range. 
{n} specifies the preceding expression is found exactly n times. 
{n,} specifies the preceding expression is found at least n times. 
{n,m} specifies the preceding expression is found at least n but no more than m times.



Regular expressions using character classes

 
Function call                                         Result
 
preg_match("/[Ff]oo/", "Foo")                         True
 
preg_match("/[^Ff]oo/", "Foo")                        False; 
 
preg_match("/[A-Z][0-9]/", "K9")                      True
 
preg_match("/[A-S]esting/", "Testing")                False; 
 
preg_match("/[A-T]esting/", "Testing")                True; 
 
preg_match("/[a-z]esting[0-9][0-9]/", "TestingAA")    False
 
preg_match("/[a-z]esting[0-9][0-9]/", "testing99")    True
 
preg_match("/[a-z]esting[0-9][0-9]/", "Testing99")    False; case sensitivity!
 
preg_match("/[a-z]esting[0-9][0-9]/i", "Testing99")   True; case problems fixed with /i
 
preg_match("/[^a-z]esting/", "Testing")               True; 
 
preg_match("/[^a-z]esting/i", "Testing")              False;



Special classes for regular expression

 
alpha represents a letter of the alphabet (either upper- or lowercase). This is equivalent to [A-Za-z].
    
    digit represents a digit between 09 (equivalent to [0-9]).
    
    alnum represents an alphanumeric character, just like [0-9A-Za-z].
    
    blank represents "blank" characters, normally space and Tab.
    
    cntrl represents "control" characters, such as DEL, INS, and so forth.
    
    graph represents all the printable characters except the space.
    
    lower represents lowercase letters of the alphabet only.
    
    upper represents uppercase letters of the alphabet only.
    
    print represents all printable characters.
    
    punct represents punctuation characters such as "." or ",".
    
    space is the whitespace.
    
    xdigit represents hexadecimal digits.



Validating a credit card number

 
<?php
function is_valid_credit_card($s) {
    $s = strrev(preg_replace("/[^\d]/","",$s));
    $sum = 0;
    for ($i = 0, $j = strlen($s); $i < $j; $i++) {
        if (($i % 2) == 0) {
            $val = $s[$i];
        } else {
            $val = $s[$i] * 2;
            if ($val > 9) { $val -= 9; }
        }
        $sum += $val;
    }
    return (($sum % 10) == 0);
}
if (! is_valid_credit_card($_POST["credit_card"])) {
    print "Sorry, that card number is invalid.";
}
?>



Validating Pascal Case Names

 
<?php
  $values = array(
    "PascalCase", // Valid
    "_notvalid",  // Not Valid
    );
  foreach ($values as $value) {
    if(preg_match("/^([A-Z][a-z]+)+$/", $value)) {
      printf(""%s" is a valid name.\n", $value);  
    } else {
      printf(""%s" is NOT a valid name.\n", $value);  
    }
  }
?>



Validating U.S. Currency

 
<?php 
$regex = "/^\\$?(\d{1,3}(,\d{3})*|\d+)\.\d\d$/"; 
$values = array( 
    "1,123.00", 
    "1123.00" 
); 

foreach ($values as $value) { 
    if (preg_match($regex, $value)) { 
        echo """ . $value . "" is a valid number.\n"; 
    } else { 
        echo """ . $value . "" is NOT a valid number.\n"; 
    } 
} 
?>