Search - regular expressions

The Spider in the extended search and replace tool and in the text search tool allows you to use regular expressions ( regular expressions) .

Regular expressions are special strings of characters that are rules or patterns that allow you to check whether the text you are searching for matches a set pattern. Special metacharacters allow, for example, to specify that the searched string must occur at the beginning or end of a line, contain a certain number of repetitions of selected characters, etc.

Regular expressions look complicated to beginners, but they are actually usually quite simple, handy and very useful.

Simple match

Each individual character matches itself until it is a metacharacter with a special meaning, described below.

A group of characters matches a group of characters in the target string, so for example the text lubudu will match the string lubudu in the target string.

You can make characters that normally function as metacharacters or escape sequences interpreted as regular characters by prefixing them with a backslash \ . For example, the metacharacter ^ matches any string that begins at the beginning of a line. However, \ ^ matches the ^ character in that string, \\ matches the \ character, etc.

Examples:

lubudumatches lubudu
\^LuBuDumatches ^LuBuDu

Escape sequences

Individual characters can be escaped by using an escape sequence, the syntax of which is similar to that known from, for example, C or Perl. For example, \n stands for a newline, \t stands for a TAB character, and so on. A more general construct is \ xnn , where nn is a string of hexadecimal digits that matches a character whose ASCII code matches the value nn . For Unicode characters, you can use \ x {nnnn} where nnnn is one or more hexadecimal digits.

\xnnASCII character nn
\x{nnnn}character with the code nnnn (one byte for ASCII text, two bytes for Unicode characters)
\tTAB character (HT / TAB), same as \x09
\nnew line (NL), same as \x0a
\rambulance return (CR), same as \x0d
\fform feed (FF), same as \x0c
\aalarm (BEL), same as \x07
\eescape (ESC), same as \x1b

Examples:

lubudu\x20trachmatches the string lubudu trach (space character between)
\tlubudumatches the lubudu string preceded by the TAB character

Character classes

You can specify a character class by enclosing a list of characters in square brackets [] that cause a match to any character in the bracketed list.

If the first character after [ is a ^ character, the class matches characters that are outside of it. This contradicts the list of characters present in the class.

Examples:

lubu[cdf]ufinds the strings lubucu, lubudu, lubufu, but does not find the strings lubuau, lububu, lubueu, etc.
lubu[^cdf]ufinds strings of characters e.g. lubuau, lububu, lubueu, etc., but does not find lubucu, lubudu, lubufu.

In the list, you can use the - character to specify a range of characters, e.g. a-z represents the range of characters between a and z including all between.

If you want to use the - character as a member of a class, that is, a search character, place it at the beginning of the list or mark it with a backslash character. If you want to search for the bracket character ] you can also place it at the beginning of the list or use the backslash character.

Examples:

[-az]matches the characters a, z and -
[az-]matches the characters a, z and -
[a\-z]matches the characters a, z and -
[a-z]matches all characters from a to z
[\n-\x0D]matches any of the characters with codes #10, #11, #12, #13
[\d-t]matches any digit, character - or t
[]-a]matches any character from ] to a

Metacharacters

Metacharacters are special characters that are the essence of regular expressions. There are different types of metacharacters, described below.

Metacharacters - line separators

^beginning of line
$end of line
\Abeginning of text
\Zend of text
.any character in a line

Examples

^lubudumatches the string lubudu only if it is at the beginning of a line
lubudu$matches the lubudu character only if it appears at the end of a line
^lubudu$matches the lubudu character only if it is the only string on the line
lubu.umatches the strings lubuau, lububu, lubucu and so on

The ^ metacharacter implicitly guarantees matching only for the beginning of the input string, and $ for the end. Internal line spearators will therefore not be matched by the ^ or $ metacharacters.

However, you may wish to treat the string as multi-line such that ^ will match after any line separator and $ will match before any line separator. You can do this with the /m switch.

The metacharacters \A and \Z work like ^ and $ except that they will not match to multiline strings when the /m switch is used, and ^ and $ match to each internal row separator.

The . metacharacter matches any character by default, but if you disable the /s switch, then the . metacharacter will not match internal row separator characters.

Metacharacters - predefined classes

\walphanumeric character (including _)
\Wnon-alphanumeric character
\ddigit
\Dnot a digit
\sany spacing (same as [ \t\n\r\f])
\Snot spacing

You can use \w, \d, and \s inside your own character classes.

Examples:

lubu\dumatches the strings lubu1u, lubu6u and so on, but does not match the strings lubudu, lubucu, etc.
lubu[\w\s]umatches the strings lubudu, lubu u, lubuuu, but does not match the strings lubu1u, lubu=u, etc.

Metacharacters - iterators

Each element of a regular expression can be followed by a metacharacter type called an iterator. Using iterators you can specify the number of times a preceding character, metacharacter or subexpression repeats in a string.

*zero or more, similar to {0,} (greedy)
+one or more, similar to {1,} (greedy)
?zero or one, similar to {0,1} (greedy)
{n}exactly n times (greedy)
{n,}at least n times (greedy)
{n,m}at least n times, but not more than m times (greedy)
*?zero or more, similar to {0,}? (not greedy)
+?one or more, similar to {1,}? (not greedy)
??zero or one, similar to {0.1}? (not greedy)
{n}?exactly n times (not greedy)
{n,}?at least n times (not greedy)
{n,m}?at least n times, but not more than m times (not greedy)

So as you can see, the numbers in the braces of the form {n, m} specify the minimum ( n ) and maximum ( m ) number of repetitions for a match to occur. You can use the {n} form interchangeably instead of {n, n} and this allows for an exact match to the number of occurrences indicated. The {n,} form, on the other hand, allows for a match of at least n times or more. There is no limit to the size of n or m , but large values can consume more memory and slow down regular expressions.

If the curly bracket occurs in another context, it is treated as a regular character.

Examples

lubu.*umatches strings such as lubuau, lubuajhkjsd33j8dsu and lubuu, etc.
lubu.+umatches strings such as lubuau, lubuajhkjsd33j8dsu, etc, but not lubuu
lubu.?uMatches strings such as lubuau, lububu, and lubuu, but not lubualkj9u
lubua{2}umatches lubuaau
lubua{2,}umatches strings such as lubuaau, lubuaau, lubuaaau, etc.
lubua{2,3}umatches strings such as lubuaau or lubuaaau, but not lubuaaaau

The greedy annotation (Polish greedy ) and not greedy used in the iterator enumeration should be explained. In short, greedy answers as many as possible, and not greedy as few as possible. For example, b + and b * applied to abbbbc will return bbbb , b +? will return b and b *? will return an empty string. b {2,3}? will return bb and b {2,3} will return bbb .

All iterators can be put into non-greedy mode using the /g modifier.

Metacharacters - Alternatives

You can specify a group of alternatives to a pattern using | to separate groups so that e.g. Feb | lup | gap will match any Feb , lup or vulnerability in the target string ( lu (t | p | k) will do the same). The first alternative contains everything from the last pattern separator ( ( , [ or the beginning of the pattern) up to the first character | , and the last alternative contains everything from the last character | to the next pattern separator, for this reason it is common practice to put alternatives in parentheses to minimize confusion about their beginning and end.

Alternatives are matched from left to right, so the first alternative found for which the entire expression matches is the one that is selected. This means that alternatives are not usually "greedy". For example, when foo|foot is compared with baerfoot only foo will match due to the fact that it is the first matching alternative that successfully matches the string being compared.

Also note that | in square brackets is interpreted as a normal character, so if you write [luk|lup|lut] you are actually looking for the expression [lukpt|].

Example: luk(asz|iem) matches two strings: luk or luk.

Metacharacters - subexpressions

The parenthesis construction ( ... ) can also be used to define regular subexpressions. Subexpressions are numbered in order from left to right depending on the opening parentheses. The first subexpression is numbered 1 (the entire result of the regular expression is numbered 0).

Examples:

(lubudu){8,10}matches strings that contain 8, 9, or 10 occurrences of the word lubudu.
lubu([0-9]|d+)umatches the strings lubu0u, lubu1u, lubudu, lubuddu, etc.

Metacharacters - backward references

Metacharacters \1 through \9 are interpreted as backward references. \ matches the previous matched subbalance #.

Examples:

(.)\1+matches aaaa and cc.
(.+)\1+also matches abab and 123123
'(["]?)(\d+)\1matches "13" (in quotation marks) or '4' (in apostrophes) or 77 (without quotation marks or apostrophe), etc.

Modifiers

Modifiers allow you to change the behavior of the regular expression search function.

There are many ways to set modifiers. Each of the modifiers can be included in a regular expression using the (?...) construct.

.
iPerforms a case-insensitive pattern search.
mTreats the string as multi-line, i.e., changes the operation of ^ and $ from always matching the beginning or end of the string to always matching the beginning or end of any line anywhere within the string.
sTreats the string as a single line i.e. changes the action of . to match any characters, even line separators, that would not normally be qualified as matching the pattern.
gNot a standard modifier. Disabling this modifier causes all subsequent operators to start operating in non-greedy mode. So if the /g modifier is disabled, then + works like +?, * like *? and so on.
xIt augments the pattern by allowing blanks and comments (explained below).

The / x modifier needs some explanation. It tells the program to ignore blanks that are backslash or inside the class. You can use this modifier to break regular expressions into more readable parts. The # character is also treated as a metacharacter to denote a comment, e.g .:

(
(abc) # comment 1
| # You can use spaces to comment regexp
(efg) # comment 2
)

This means that if you want to include spaces and the # character in a pattern (except in the character class, where the /x modifier does not apply to them), you must either use a backslash character to denote them or encode them with hexadecimal or octal character code values. In summary, these properties allow you to make the regular expression more readable.

Extensions from Perl

(?imsxr-imsxr)

You can use them inside regular expressions on the fly. If such a construct is included in a subexpression, it affects only that subexpression.

Examples:

(?i)New-yearmatches the New-year and New-Year sequences
(?i)New-(?-i)Yearmatches the New-Year sequence, but does not match the New-year sequence
(?i)(New-)?Yearmatches the New-year and new-year sequence
((?i)New-)?Yearmatches the new-Year sequence, but does not match the new-year string

(?#text)

A comment whose text is ignored. Note that the program closes the comment as soon as it encounters the metacharacter ), so there is no way to include the character ) in the comment.

Using "Replace with" in regular expression search results

You will want to use the phrases you find frequently in the Replace with field. In this case, in the text that will be inserted in place of the phrase, the appropriate symbol should be placed, e.g. $ 1 , $ 2 , $ 0 (where $ 1 is the first fragment matched with the expression, $ 2 the second, and so on, and $ 0 matches the entire matching phrase - all fragments)

Examples: Let us assume that the content of many documents includes, among others text:

<a href="gallery_first.php">First gallery</a>
<a href="gallery_second.php">Second gallery</a>
<a href="gallery_third.php">Third gallery</a>
<a href="https://guestbook.com/index.php">Guest book</a>

However, you have decided to rewrite the whole site to use PHP and all the links should be fixed. At first glance, the easiest way would be to simply change the .html string to .php everywhere. However this is not a good idea, because the extension in the guestbook links will also be changed. Therefore, you should use the regular expression capabilities in the Extended Search and Replace tool.

In the'Find' field, enter:

gallery_([a-z0-9]+){1}.html

All strings containing ' gallery_ ' will be found, followed by a string of lowercase letters or numbers (the entire substring will be treated as one occurrence because it is enclosed in parentheses followed by {1} ) and finally the .html extension

In the'Replace with' field, enter:

gallery_$1.php

The above entry means that the searched phrase will be changed to ' gallery_ ', further searched for lowercase letters or numbers (respectively ' first ', ' second ',' third ') and the extension' .php '. The guest book reference will of course remain unchanged.

The result will be the following content:

<a href="gallery_first.php">First gallery</a>
<a href="gallery_second.php">Second gallery</a>
<a href="gallery_third.php">Third gallery</a>
<a href="https://guestbook.com/index.php">Guest book</a>

Related topics

To top