Regular expressions. in php

Lecture



Regular expressions.

Regular expressions are a powerful, flexible tool for parsing text according to a specific pattern.

A template is a string of characters, special characters and modifiers that describe the rules to which the text to be parsed should correspond.


Regular expression syntax .

This is a section on what the templates consist of.
Any pattern must be limited to limit characters. As such characters, you can use any non-alphanumeric character except '\'.
It is not recommended to use other special characters as limiters, as their use inside the template will become inconvenient. It is preferable to use the / symbol because it does not perform any special functions.
The character used as a template delimiter inside the template should be escaped.
Example: '/ pattern / i' - matches the line with the word pattern. i is a modifier. Looking ahead, I will say that it means a case-insensitive comparison.

1 - Special characters.
First about them, because it will be easier for me to describe the rest.
\ - shielding character.
Example: '/ qwe \ / rty /' matches the string that has qwe / try. The symbol / we escaped, after which it ceased to fulfill its special meaning in this place (it was the limiter of the pattern).
^ - data start symbol.
$ is the end of data character.
Example: '/ ^ pattern $ /' - Matches a string exactly matching the word pattern. Those. with the letter p, the line begins and after n ends.
. - any character except newline. But there is a modifier, when using which a line break also applies to "any" characters.
Example: '/pat.ern/' - Corresponds to the string containing the pattern, or patdern, or pat3ern ...
[] - within these brackets are listed characters, any one character of which can stand in this place. This is called a character class. Special characters written in [] behave a little differently. I will write this.
Example: '/ pat [aoe] rn /' - only lines containing patarn, patorn or patern will be matched.
| - Or. An example below.
() is a sub-mask.
? - one or zero occurrences of the preceding character or de-mask.
* - any number of occurrences of the preceding character or trappings. Including zero.
+ - one or more occurrences.
Example: '/as+(es|du )?.*r/' - The letter a, then one or more letters s, after that the combination es or du can be once, but it can never, then any number of any characters and letter r.
Here I will say about another character value? Metacharacter default asterisk greedy (and others too). This means that in our example, this part of '. * R' will correspond, for example, to the substring asdrfsrsfdr. As you can see, to the last letter r two more fell into it. This greed can be turned off. Those. the pattern will match only the substring asdr. Until the first r. For this it is necessary in up to the place where it is necessary to disable greed will put the modifier (? U). Here is another use for symbols? and ().
{a, b} is the number of occurrences of the preceding character or sub-mask from a to b. If not specified, it is considered that there is no upper limit. For example, * is the same as {0,}. ? - the same as {0,1}. {5.7} - 5.6 or 7 reps.

a) Special characters inside a character class.
^ - denial.
Example: [^ da] - matches any character except d and a.
Example: [^^] - matches any character except ^.
Example: [d ^ a] - matches any of the three listed characters. [\ ^ da] is the same.
In the last example, as you can see, the symbol is not at the beginning of the enumeration and loses its metafunction. And to screen it, by the way, is also not necessary here.
- - within character class means character interval.
Example: [0-9a-e] - matches any character from 0 to 9 and from a to e. If in the character class it is necessary to enumerate the hyphen character itself, then it should either be escaped or placed before].
Carefully in the character class you must use the \ character. If you put it in front of], it will be shielded. Any character that can be shielded will also be screened. Otherwise, the \ character is a regular character.
The $ character is also a regular character within a character class. And brackets too.

b) The \ character. One of its functions is the removal of special meaning from special characters. And the other, on the contrary, gives special functions to ordinary characters.
\ cx - ctrl + x. In place of x can be any character.
\ e - escape.
\ f - page break.
\ n, \ r, \ t - this is our usual way. Line feed, carriage return and tab.
\ d is any character that represents a decimal digit.
\ D is any character that does not mean a decimal digit.
\ s - any whitespace character.
\ S is not a whitespace.
\ w - any digit, letter or underscore.
\ W is any character, but not \ w.
\ b - word boundary. Can be used instead of \ w \ W or \ W \ w or ^ \ w or \ w $
\ B is not a word boundary.
The last two designs do not correspond to any real characters.
\ xHH is a character with a hexadecimal code HH. x is exactly the letter x.
\ DDD - character with octal code DDD. Or a link to a mask.

About the reference to the submask: '/([0-9[2,3,3')). 1/2 - 2 or 3 characters from 0 to 9, then any sequence of characters and the same 2 or 3 specific characters that corresponded a disguise. That is, the string 'as34sdf34' will do. There are 34, and there. And 'sd34dg32' is not.
If the analyzer finds \ x, it reads the maximum number of subsequent characters, which can be a hexadecimal number. The maximum is no more than two. If out of three, then it is considered two, if less - how much is there.
If the analyzer finds \ 0, it does the same. Only reads not hexadecimal, but large numbers. Up to two pieces. That is, \ 0 \ x \ 0325 means two characters with code zero, a character with octal code 32 and a five.
If after a slash there is a nonzero number, then that is more complicated. Here we write such a thing: \ 40. If there are 40 submasks in the template, this will be interpreted as a reference to the 40th submask. Fortieth - in decimal number system. If the mask is smaller, it will be perceived as a symbol with the octal code 40.
\ 040 is always a character with an octal code and 40.
\ 7 is always a reference to a mask.
\ 13 - depending on the situation.
In a character class, it is possible to specify character ranges with the help of codes: [\ 044- \ 056]
It is also worth noting that references to deception cannot be more than 99.

2 - Common characters.
These are non-special characters.

3 - Modifiers.
They are specified either in parentheses, for example: (? Ui), or after the closing character '/ pattern / Ui'.
i - case-insensitive.
U - inverts greed.
m - multiline search.
s - if used, the symbol. matches and line feed. Otherwise, it does not correspond to it.
x - ignores all unshielded whitespace if they are not listed in the character class. Conveniently, when you want to enter with enterters and spaces, it is convenient to read easily in a regular routine.
When using modifiers, you can use the '-' sign to disable the modifier. (? mi) - Turns on multi-line search and disable case-insensitive.
Here it must be said that all modifiers include something. Or disable, if indicated with a minus. But U inverts. Those. if greed was on, it would turn off without any drawbacks.

4 - Statements.
Assertions are checks for characters going before or after the current matching position. For example, \ b is the statement that the previous character is verbal, and the next is not, or vice versa. But this is a kind of built-in statement, and we will now learn to write our own.
Statements regarding subsequent text begin with (? = For positive statements and (?! For denying statements).
Statements concerning the preceding text begin with (? <= For positive statements and (? <! For deniers.
For example, '/ (? <! Foo) bar /' will not find the occurrence of "bar", which is not preceded by "foo". Those. qwefoobar will ignore this template, and asacdbar will suit it.
(? <= \ d {3}) (? <! 999) foo matches the substring "foo", which is preceded by three digits other than "999". It should be understood that each of the statements is checked against the same position in the processed text.
Statements can be nested, and in arbitrary combinations: (? <= (? <! Foo) bar) baz corresponds to the substring "baz", preceded by "bar", before which, in turn, is not 'foo'.

a) Conditional deception.
In my opinion, this is enough: (? (Condition) yes-pattern | no-pattern)
Example: (? (? = \ D) u | p). (? = \ d) is a condition. We argue that after this place is a figure. If it is true, then the letter u should be in this place. Else - p.

5 - Comments.
Comments begin with (? # And continue to the nearest closing bracket. Just like / * * / in PHP, without nesting.

That's all. This theory, in principle, should suffice.


PHP functions for working with regular expressions.

mixed preg_match (string $ pattern, string $ subject [, array $ & mp; matches [, int $ flags [, int $ offset]]])
Searches for the subject text for a pattern match. If no match is found, it returns false.
In case the additional matches parameter is specified, it will be filled with search results. The $ matches [0] element will contain the part of the string corresponding to the occurrence of the entire template, $ matches [1] the part of the string corresponding to the first submask, and so on. $ pattern - pattern, $ subject - where to look. There are a couple of examples in the manual.

Similar preg_match_all function with the same parameters. It finds all matches while preg_match is only the first.

array preg_split (string $ patte rn, string $ subject [, int $ limi t [, int $ flags]])
Returns an array consisting of the substrings of the specified subject line, which is divided by the boundaries corresponding to the pattern pattern.
If the limit parameter is specified, the function returns no more than limit substrings. The special limit value of -1 implies no limit

mixed preg_replace (mixed $ patt ern, mixed $ replacement, mixed $ subject [, int $ limit])
Searches the subject string for pattern matches and replaces them with replacement . If the limit parameter is specified, the limit occurrences of the template will be replaced; if limit is omitted or equals -1, all occurrences of the pattern will be replaced.
$ replacement may contain references to the template subtitles. Thus, it is possible to swap parts in the line corresponding to two different submasks.

mixed preg_replace_callback (mi xed $ pattern, callback $ callback, mixed $ subject [, int $ limit]) Performs a regular expression search and replacement using the callback function. Example:

<?PHP
function rnd_replace($matches)
{
if ($matches[1] > 'c')
return '('.$matches[1].'->'.rand(0, 9).')';
else
return $matches[1];
}
$src = 'sd4vaf345g534fgh43kj3';
$res = preg_replace_callback('/(\D)/', 'rnd_replace', $src);
echo $res
?>

All the numbers that are larger than the ones that will be replaced will see for yourself.

Tasks for the section

1. You have a php-code. The string indices of arrays in it are not enclosed in quotes. You need to enclose them in quotes. But note that variables and functions can be indices of arrays in the code - they should not be in quotes. Declared constants do not take into account.

2. Given a string. Check if all the characters in it are unique.

3. Check the syntactic correctness of the string containing the e-mail

4. Check the syntactic correctness of the date. The date format is 'dd-mm-yyyy'. Day and month, less than 10 can be written in one digit. It would be nice to check the same for how many days in a month. Lead year is not necessary to consider.

5. Find all the links on the page.

there are special class libraries that simplify the work with regular expressions


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Running server side scripts using PHP as an example (LAMP)

Terms: Running server side scripts using PHP as an example (LAMP)