PHP/String/Regular Expressions

Материал из Web эксперт
Перейти к: навигация, поиск

^ and $ are line anchors.

   <source lang="html4strict">

^ specifies the beginning of the line. $ specifies the end of the line.

 </source>
   
  


\b and \B, equate to "On a word boundary" and "Not on a word boundary," respectively.

   <source lang="html4strict">

<?

   $string = "this is a test!";
   if (preg_match("/oo\b/i", $string)) {
   }
   preg_match("/oo\B/i", $string);
   preg_match("/no\b/", "he said "no!"");
   preg_match("/royalty\b/", "royalty-free photograph");

?>

 </source>
   
  


Brackets [] finds a range of characters.

   <source lang="html4strict">

Regexp [php] finds any string containing the character p or h. [0�9] matches any decimal digit from 0 through 9. [a�z] matches any character from lowercase a through lowercase z. [A�Z] matches any character from uppercase A through uppercase Z. [a�Z] matches any character from lowercase a through uppercase Z.

 </source>
   
  


Character Classes

   <source lang="html4strict">

[ indicates the beginning of a character class. - indicates a range inside a character class (unless it is first in the class). ^ indicates a negated character class (if found first). ] indicates the end of a character class.

 </source>
   
  


Complete list of regular expression examples

   <source lang="html4strict">

Expression Will match . . .

foo The string "foo"

^foo "foo" at the start of a line

foo$ "foo" at the end of a line

^foo$ "foo" when it is alone on a line

[Ff]oo "Foo" or "foo"

[abc] a, b, or c

[^abc] d, e, f, g, V, %, ~, 5, etc.everything that is not a, b, or c (^ is "not" inside character classes)

[A-Z] Any uppercase letter

[a-z] Any lowercase letter

[A-Za-z] Any letter

[A-Za-z0-9] Any letter or number

[A-Z]+ One or more uppercase letters

[A-Z]* Zero or more uppercase letters

[A-Z]? Zero or one uppercase letters

[A-Z]{3} Three uppercase letters

[A-Z]{3,} A minimum of three uppercase letters

[A-Z]{1,3} One, two, or three uppercase letters

[^0-9] Any non-numeric character

[^0-9A-Za-z] Any symbol (not a number or a letter)

(cat|sat) Matches either "cat" or "sat"

([A-Z]{3}|[0-9]{4}) Matches three letters or four numbers

Fo* F, Fo, Foo, Fooo, Foooo, etc.

Fo+ Fo, Foo, Fooo, Foooo, etc.

Fo? F, Fo

. Any character except \n (new line)

\b A word boundary; e.g. te\b matches the "te" in "late" but not the "te" in "tell."

\B A non-word boundary; "te\B" matches the "te" in "tell" but not the "te" in "late."

\n Newline character

\s Any whitespace (new line, space, tab, etc.)

\S Any non-whitespace character

 </source>
   
  


Define a pattern and use parentheses to match individual elements within it

   <source lang="html4strict">

<? $test = "Whatever you do, don"t panic!"; if ( preg_match( "/(don"t)\s+(panic)/", $test, $array ) ) {

print "
\n";
  print_r( $array );
  print "
\n";

} ?>

 </source>
   
  


Greedy and non-greedy matching

   <source lang="html4strict">

<? $meats = "Chicken, Beef, Duck"; preg_match_all("@.*?@",$meats,$matches); foreach ($matches[0] as $meat) {

   print "Meat A: $meat\n";

} preg_match_all("@.*@",$meats,$matches); foreach ($matches[0] as $meat) {

   print "Meat B: $meat\n";

} ?>

 </source>
   
  


Greedy Qualifiers

   <source lang="html4strict">

Qualifier What It Matches

  • The preceding expression can be found any number of times, including one.

+ The preceding expression can be found one or more times. ? The preceding expression can be found at most once.

 </source>
   
  


Greedy versus nongreedy matching

   <source lang="html4strict">

<?php $html = "love you ."; // Greedy $matchCount = preg_match_all("@.+@", $html, $matches); print "Greedy count: " . $matchCount . "\n"; // Nongreedy $matchCount = preg_match_all("@.+?@", $html, $matches); print "First non-greedy count: " . $matchCount . "\n"; // Nongreedy $matchCount = preg_match_all("@.+@U", $html, $matches); print "Second non-greedy count: " . $matchCount . "\n"; ?>

 </source>
   
  


Grouping captured subpatterns

   <source lang="html4strict">

<?php $todo = " first=a next=B last=C "; preg_match_all("/([a-zA-Z]+)=(.*)/", $todo, $matches, PREG_SET_ORDER); foreach ($matches as $match) {

   print "The {$match[1]} action is {$match[2]} \n";

} ?>

 </source>
   
  


Line Anchors

   <source lang="html4strict">

^ specifies the beginning of the line. $ specifies the end of the line.

 </source>
   
  


Match an IP address

   <source lang="html4strict">

<? $test = "156.152.55.35"; if ( preg_match( "/(\d+)\.(\d+)\.(\d+)\.(\d+)/", $test, $array ) ) {

print "
\n";
  print_r( $array );
  print "
\n";

} ?>

 </source>
   
  


Matching a Valid E-mail Address

   <source lang="html4strict">

<?php $regex = "/^[\w\d!#$%&"*+-\/=?^"{|}~]+(\.[\w\d!#$%&"*+-\/=?^"{|}~]+)*@([a-z\d][-a-z\d]*[a-z\d]\.)+[a-z][-a-z\d]*[a-z]$/"; $values = array( "user@example.ru", "user@example" ); foreach ($values as $value) {

   if (preg_match($regex, $value)) { 
       printf("Found valid address: %s\n", $value); 
   } else { 
       printf("INVALID address: %s\n", $value); 
   } 

} ?>

 </source>
   
  


Matching a Valid IP Address

   <source lang="html4strict">

<?php $good_ip = "192.168.0.1"; $bad_ip = "1.334.10.10"; $regex = "^(([1-9]?[0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]).){3}.([1-9]?[0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$" ; if (ereg($regex, $good_ip)) {

   echo """ . $good_ip . "" is a valid ip address.\n"; 

} else {

   echo """ . $good_ip . "" is an INVALID ip address.\n"; 

} if (ereg($regex, $bad_ip)) {

   echo """ . $bad_ip . "" is a valid ip address.\n"; 

} else {

   echo """ . $bad_ip . "" is a INVALID ip address.\n"; 

} ?>

 </source>
   
  


Matching GUIDs/UUIDs

   <source lang="html4strict">

<?php $uuid = "1111-1111-1111-1111"; function printResults($str) { if (eregi("^[0-9a-f]{8}(-[0-9a-f]{4}){3}-[0-9a-f]{12}$", $str)) {

   printf(""%s" is a valid GUID/UUID.\n", $str); 

} else {

   printf(""%s" is NOT a valid GUID/UUID.\n", $str); 

} } printResults($uuid); ?>

 </source>
   
  


Matching using backreferences

   <source lang="html4strict">

<? $ok_html = "I love shrimp dumplings."; if (preg_match("@<[bi]>.*?</[bi]>@",$ok_html)) {

   print "Good for you! (OK, No backreferences)\n";

} if (preg_match("@<([bi])>.*?</\\1>@",$ok_html)) {

   print "Good for you! (OK, Backreferences)\n";

} ?>

 </source>
   
  


Matching with |

   <source lang="html4strict">

<?php $text = "The files are c.gif, r.pdf, and e.jpg."; if (preg_match_all("/[a-zA-Z0-9]+\.(gif|jpe?g)/",$text,$matches)) {

   print "The image files are: " . implode(",",$matches[0]);

} ?>

 </source>
   
  


Matching with character classes and anchors

   <source lang="html4strict">

<?php $thisFileContents = file_get_contents(__FILE__); $matchCount = preg_match_all("/\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*/",$thisFileContents, $matches); print "Matches: $matchCount\n"; foreach ($matches[0] as $variableName) {

   print "$variableName\n";

} ?>

 </source>
   
  


Matching with Greedy vs. Nongreedy Expressions

   <source lang="html4strict">

<?php $teststring = ""Hello" and "Goodbye.""; $greedyresult = preg_replace("/".*"/", ""***"", $teststring); $nongreedyresult = preg_replace("/".*?"/", ""***"", $teststring); echo "Original: $teststring\n"; echo "Greedy Replace: $greedyresult\n"; echo "Nongreedy Replace: $nongreedyresult\n"; ?>

 </source>
   
  


Match the smallest number of characters starting with "p" and ending with "t"

   <source lang="html4strict">

<? $text = "pot post pat patent"; if ( preg_match( "/p.*?t/", $text, $array ) ) {

print "
\n";
  print_r( $array );
  print "
\n";

} ?>

 </source>
   
  


Match URL

   <source lang="html4strict">

<?php

 $hostRegex = "([a-z\d][-a-z\d]*[a-z\d]\.)*[a-z][-a-z\d]*[a-z]";
 $portRegex = "(:\d{1,})?";
 $pathRegex = "(\/[^\s?]+)?";
 $queryRegex = "(\?[^<>#\"\s]+)?";
 $urlRegex = "/(?:(?<=^)|(?<=\s))((ht|f)tps?:\/\/" . $hostRegex . $portRegex . $pathRegex . $queryRegex . ")/";
 $str = "This is my homepage:  http://home.example.ru.";
 $str2 = "This is my homepage:  http://home.example.ru:8181/index.php";
 $sample1 = preg_replace($urlRegex, "<a href=\"\\1\">\\1</a>", $str);
 $sample2 = preg_replace($urlRegex, "<a href=\"\\1\">\\1</a>", $str2);
 echo $sample1 . "\n";
 echo $sample2 . "\n";

?>

 </source>
   
  


Nongreedy Qualifiers

   <source lang="html4strict">

Qualifier What It Matches

  • ? The preceding expression can be found any number of times, but the matching will stop as soon as it can.

+* The preceding expression can be found one or more times, but the matching will stop as soon as it can.

 </source>
   
  


Option patterns:

   <source lang="html4strict">

(pattern) = Groups the pattern to act as one item and captures it


   (x|y) = Matches either pattern x, or pattern y
   
   
   
   [abc] = Matches either the character a, b, or c
   
   
   
   [^abc] = Matches any character except a, b, or c
   
   
   
   [a-f] = Matches characters a through f
 
 </source>
   
  


Pattern matches:

   <source lang="html4strict">

\d = Digit


   \D = Not a digit
   
   
   
   \s = Whitespace
   
   
   
   \S = Not whitespace
   
   
   
   . = Any character (except \n)
   
   
   
   ^ = Start of string
   
   
   
   $ = End of string
   
   
   
   \b = Word boundary
 
 </source>
   
  


Pattern match extenders:

   <source lang="html4strict">

? = Previous item is match 0 or 1 times.


   * = Previous item is matched 0 or more times.
   
   
   
   + = Previous item is matched 1 or more times.
   
   
   
   {n} = Previous item is matched exactly n times.
   
   
   
   {n,} = Previous item is matched at least n times.
   
   
   
   {n,m} = Previous item is matched at least n and at most m times.
   
   
   
   ? (after any of above) = Match as few as possible times.
 
 </source>
   
  


Perl-Compatible Regular Expressions (PCRE)

   <source lang="html4strict">

\w represents a "word" character and is equivalent to the expression [A-Za-z0-9].

   \W represents the opposite of \w and is equivalent to [^A-Za-z0-9].
   
   \s represents a whitespace character.
   
   \S represents a nonwhitespace character.
   
   \d represents a digit and is equivalent to [0-9].
   
   \D represents a nondigit character and is equivalent to [^0-9].
   
   \n represents a newline character.
   
   \r represents a return character.
   
   \t represents a tab character.
 
 </source>
   
  


POSIX Regular Expressions Character Classes

   <source lang="html4strict">

Expression Meaning alpha: A letter, such as A?Z or a?z digi: A number 0?9 space: Whitespace, such as a tab or space character 4.60\< or \> Word boundaries

 </source>
   
  


Predefined Character Ranges (Character Classes)

   <source lang="html4strict">

alpha: matches any string containing alphabetic characters aA through zZ. digit: matches any string containing numerical digits 0 through 9. alnum: matches any string containing alphanumeric characters aA through zZ and 0 through 9. space: matches any string containing a space.

 </source>
   
  


Qualifiers restrict the number of times the preceding expression may appear.

   <source lang="html4strict">

The common single-character qualifiers are ?, +, and *. ? means "zero or one," + means "one or more."

  • means "zero or more."
 </source>
   
  


Quantifiers for Matching a Recurring Character

   <source lang="html4strict">

Symbol Description Example

  • Zero or more instances a*

+ One or more instances a+

? Zero or one instance a?

{n} n instances a{3}

{n,} At least n instances a{3,}

{,n} Up to n instances a{,2}

{n1, n2} At least n1 instances, no more than n2 instances a{1,2}

 </source>
   
  


Quantifiers: +, *, ?, {int. range}, and $ follow a character sequence:

   <source lang="html4strict">

p+ matches any string containing at least one p. p* matches any string containing zero or more p"s. p? matches any string containing zero or more p"s. This is just an alternative way to use p*. p{2} matches any string containing a sequence of two p"s. p{2,3} matches any string containing a sequence of two or three p"s. p{2, } matches any string containing a sequence of at least two p"s. p$ matches any string with p at the end of it. ^p matches any string with p at the beginning of it. [^a?zA?Z] matches any string not containing any of the characters ranging from a through z and A through Z. p.p matches any string containing p, followed by any character, in turn followed by another p.

 </source>
   
  


Ranges

   <source lang="html4strict">

{ specifies the beginning of a range. } specifies the end of a range. {n} specifies the preceding expression is found exactly n times. {n,} specifies the preceding expression is found at least n times. {n,m} specifies the preceding expression is found at least n but no more than m times.

 </source>
   
  


Regular expressions using character classes

   <source lang="html4strict">

Function call Result

preg_match("/[Ff]oo/", "Foo") True

preg_match("/[^Ff]oo/", "Foo") False;

preg_match("/[A-Z][0-9]/", "K9") True

preg_match("/[A-S]esting/", "Testing") False;

preg_match("/[A-T]esting/", "Testing") True;

preg_match("/[a-z]esting[0-9][0-9]/", "TestingAA") False

preg_match("/[a-z]esting[0-9][0-9]/", "testing99") True

preg_match("/[a-z]esting[0-9][0-9]/", "Testing99") False; case sensitivity!

preg_match("/[a-z]esting[0-9][0-9]/i", "Testing99") True; case problems fixed with /i

preg_match("/[^a-z]esting/", "Testing") True;

preg_match("/[^a-z]esting/i", "Testing") False;

 </source>
   
  


Special classes for regular expression

   <source lang="html4strict">

alpha represents a letter of the alphabet (either upper- or lowercase). This is equivalent to [A-Za-z].

   digit represents a digit between 09 (equivalent to [0-9]).
   
   alnum represents an alphanumeric character, just like [0-9A-Za-z].
   
   blank represents "blank" characters, normally space and Tab.
   
   cntrl represents "control" characters, such as DEL, INS, and so forth.
   
   graph represents all the printable characters except the space.
   
   lower represents lowercase letters of the alphabet only.
   
   upper represents uppercase letters of the alphabet only.
   
   print represents all printable characters.
   
   punct represents punctuation characters such as "." or ",".
   
   space is the whitespace.
   
   xdigit represents hexadecimal digits.
 
 </source>
   
  


Validating a credit card number

   <source lang="html4strict">

<?php function is_valid_credit_card($s) {

   $s = strrev(preg_replace("/[^\d]/","",$s));
   $sum = 0;
   for ($i = 0, $j = strlen($s); $i < $j; $i++) {
       if (($i % 2) == 0) {
           $val = $s[$i];
       } else {
           $val = $s[$i] * 2;
           if ($val > 9) { $val -= 9; }
       }
       $sum += $val;
   }
   return (($sum % 10) == 0);

} if (! is_valid_credit_card($_POST["credit_card"])) {

   print "Sorry, that card number is invalid.";

} ?>

 </source>
   
  


Validating Pascal Case Names

   <source lang="html4strict">

<?php

 $values = array(
   "PascalCase", // Valid
   "_notvalid",  // Not Valid
   );
 foreach ($values as $value) {
   if(preg_match("/^([A-Z][a-z]+)+$/", $value)) {
     printf(""%s" is a valid name.\n", $value);  
   } else {
     printf(""%s" is NOT a valid name.\n", $value);  
   }
 }

?>

 </source>
   
  


Validating U.S. Currency

   <source lang="html4strict">

<?php $regex = "/^\\$?(\d{1,3}(,\d{3})*|\d+)\.\d\d$/"; $values = array(

   "1,123.00", 
   "1123.00" 

);

foreach ($values as $value) {

   if (preg_match($regex, $value)) { 
       echo """ . $value . "" is a valid number.\n"; 
   } else { 
       echo """ . $value . "" is NOT a valid number.\n"; 
   } 

} ?>

 </source>