|
Document Title |
Programming Introduction to Regular Expressions |
|
Type |
Reference Guide |
|
Categories |
Programming |
|
Level |
Advanced |
|
Keywords |
linux, programming, shell, script, regular, expression, expression, regexp, regex, perl, php, bash, mod_rewrite |
|
Author |
Stewart Watkiss www.watkissonline.co.uk |
|
Date Created |
9th September 2009 |
|
Last Updated |
9th September 2009 |
|
Copyright |
© 2006 - 2009 www.penguintutor.com |
Regular expressions are a way of manipulating text by using character strings and metacharacters. They are often used by programmers, but can also be used in shell commands and in some programs (e.g. vi and other text editors). The examples that follow are mainly based upon the regular expression format used by perl. Some programs / languages may use different syntax or meta-characters but the general concept is the same.
Regular expressions can initially be compared with wild-characters that you may already be familiar with, but regular expressions are much more powerful. For example most people are familiar with the ? and * when used on the command line to mean: match any single character, and match any number of characters, respectively.
To match a using a regular expression
the expression is first contained within forward slashes. For
example:/Linux/
would match on the word Linux,
which could occur anywhere in the string being tested.
Regular Expressions are often used to
find and replace text so the example:s/Linux/UNIX/
would
replace the word Linux with UNIX (useful if you wanted to convert
some documentation to a more generic UNIX reference rather than
restricting to Linux. The "s" means substitute, in the
match earlier there is an implied "m" character meaning
match. Depending upon the program / language being used you may or
may not need to include the "s" or "m"
characters.
There are also a number of modifiers that can be added to the end of the regular expression. The two most common are:
g which means match globally, ie. replace all occurrences in the string rather than stopping after the first one (which is probably what we really wanted for the ealier example).
i means ignore case (or perform a case-insensitive match).
So our earlier example could become:s/Linux/UNIX/gi
to
replace for all occurrences regardless of case.
The above examples are nothing more than a basic find and replace, but the real power of regular expressions is when you include meta-characters and other special commands. The metacharacters are:
|
\ |
Escape the character following it (ie. ignore special properties of the following character) |
|---|---|
|
. |
Match any single character except new line |
|
^ |
Match at the beginning of the string, or if the first character in square brackets matches elements not in the list |
|
$ |
Match at the end of the string |
|
* |
Match the preceding element 0 or more times |
|
? |
Match the preceding element 0 or 1 time |
|
{...} |
Range of occurrences for the preceding element |
|
[...] |
Match any of the characters between the brackets |
|
(...) |
Groups regular expressions often leaves values in $x - where x is the number of the bracket pair (staring at 1) |
|
| |
Matches either the preceding or following character/expression |
There are also some escape sequences / character classes that can be used:
|
\n |
Newline |
|---|---|
|
\r |
Carriage return |
|
\t |
Tab |
|
\f |
Formfeed |
|
\d |
Match a digit e.g. [0-9] |
|
\d |
Match a non-digit e.g. [^0-9] |
|
\w |
Match a word character (alphanumeric) e.g. [a-zA-Z_0-9] |
|
\W |
Match a non-word character e.g. [^a-zA-Z_0-9] |
|
\s |
Match a whitespace character e.g. [ \t\n\r\f] |
|
\S |
Match a non-whitespace character e.g. [^ \t\n\r\f] |
When using square brackets, some characters can have different meanings depending on whether they are inside our outside the brackets. A period '.' outside the brackets will match on any character, but inside will only match on a period character.
/[df]og/ Matches dog or
fog
/^The/ Matches "The"
if it is at the beginning of the string
/^The end$/ Matches "The
end" if that is the entire string
/The\s*end/ Matches "The
end" and "The end"
but not "The - end"
/The.*end/ Matches “The
end” as well as "The book will soon be coming to an end."
Basic shell programming reference guide
Programming perl regular expression quick reference guide