Tutorials

Regular Expressions (Part1) - Basic Syntax

Views: 109887
Rating: 5/5
Votes: 9

Table of Contents

1. Regular Expressions Basics
2. Creating Your Own Patterns
3. How Metacharacters Work
4. Quantifier Greediness
5. Pattern Modifiers
6. PCRE vs. POSIX
7. Putting it All Together
8. Conclusion and Future Tutorials

Regular Expressions Basics

What are Regular Expressions?
Regular expressions (which will now be referred to as "regexes") are basically pattern matching inside of text. They use special syntax and concepts in order to obtain information from a string. Many programming languages have some sort of support for regexes, because of the sheer usefulness of them. Not only can patterns be used to validate that a certain pattern exists in a string, but they can also be used to physically extract matched portions and make them usable in your PHP code. Although it's not mentioned in this tutorial, keep it in mind as you're reading it, because it's the main focus of the next tutorial in the series.

How are they different from regular string searching functions?
Regular string searching functions like strpos() and str_replace() are very limited in what they can actually do. They're great for replacing a single character or two, or changing 'ham' to 'turkey' in a recipe, but that's about all they can do. If you want to see if a string contained all numbers (and possibly commas to separate the numbers), you'd have to create some convoluted processing routine along the lines of:

<?php
$string = "731,489,392,222";
foreach (str_split($string) as $char) {
	if (!is_numeric($char) or $str != ',') {
		$bad = true;
	}
}
if (!$bad) {
....
}
?>

That's where the power of regexes comes into play. They have amazing capabilities in terms of analyzing strings for certain patterns and matches.

Why bother? I can make parsing routines that do the job just fine.
This question is actually quite commonly asked for first-time regex users. Regular expressions have quite a learning curve, and they have a very intricate syntax, but they are one of the most powerful ways to get PHP to handle data precisely as you want. Sometimes, the kinds of data and patterns that regular expressions can parse are simply not possible to emulate with regular string handling functions. Regexes can be quite daunting at first, but when they're broken up into individual parts and their syntax is analyzed, they're actually quite simple.

Well, now it's time for you to learn how to create regular expressions of your very own.

Note: This tutorial is just covering regular expression syntax. Actual pattern matching, substitution, and handling of results in PHP will be covered in subsequent tutorials. Also, this tutorial set covers Perl-Compatible Regular Expressions (PCRE), for reasons which will be discussed later.

Creating Your Own Patterns

Before we start interpreting regular expression syntax, it would be a good idea to first see what a pattern looks like. A pattern must ALWAYS have an opening delimiter and an ending delimiter to "enclose" the pattern so the engine knows where to stop and for separating modifiers from the rest of the pattern, which will be discussed later on in the tutorial. The most commonly seen delimiter is /, but it's often advisable to use some obscure character that will never end up in one of your patterns (such as `, #, or !). For most of the examples in this tutorial, I'll be using / as my delimiter, simply because it's conventional, but you can use any character really. Now that we have delimiters out of the way, let's look at one of the most simple patterns.

/abc/

When a plain letter is shown in a regular expression, it's interpreted as just that -- a plain letter. This pattern would match ANY string containing a 'abc' in that order. Not very useful, but it illustrates some basic principles. Notice how the part you want matched ('abc') is contained within the delimiters? That's how every pattern has to be formatted, otherwise it simply won't work.

More Than Plain Letters
Being able to throw some plain letters inside of / / is great, but you could just use your trusty strpos() for that. Where the real power of regular expressions comes in is with metacharacters. These characters don't match themselves, but actually "misbehave" a bit. Some misbehave so much that they actually cause other characters to misbehave. The metacharacters in regular expressions are:

\ | . ( ) [ ] { } ^ $ + ?

In order to actually match these characters literally, you'd need to add a backslash before them. So, by that statement, in order to match a literal ., you could use the following pattern:

/\./

Notice once again the delimiters (/ /), and the backslash which comes before the metacharacter to "de-meta" it. But a bit of an interesting point to note would be that in order to match a literal backslash, you'd need to put 2 backslashes, since the backslash itself is a metacharacter.

/\\/

How Metacharacters Work

This section will be completely devoted to an in-depth explanation of each of the metacharacters, and how they act when used in a pattern.

The Catch-All Metacharacter (.)
The . is actually a pretty simple concept. The . metacharacter matches ANY character except for a newline (\n), but that can be modified (we'll explain that much later). A simple demonstration of the . at work would be in a pattern like:

/c.t/

This pattern would match cat, cut, cbt, cet, czt, c4t, and so on, but not caat because there's only one dot (.) in the pattern. It's important to note that . only matches ONE character, until quantifiers are introduced.

The Anchoring Metacharacters (^ $)
Sometimes you'll need to actually make sure that a string begins or ends with a certain character. That's where ^ and $ come in.
The ^ character matches the beginning point of a string. It doesn't actually match a literal character, but instead, it matches a boundary. To make sure that a string starts with a Z, you could apply the following pattern:

/^Z/

The $ character is essentially the same as the ^ character, except it matches the end of a string. To make sure that a string ended with a g, you'd use:

/g$/

Note how the $ came after the g, unlike in the pattern with ^. This is because the pattern is literally telling the regex engine "a g followed by the end of the string boundary", not the other way around. In fact, it would make little sense, because you can't have characters after the end of a string. The only way that characters can appear after a $ is if you have your regular expression multi-line mode. It will be discussed in depth later on in this tutorial, but keep in mind that the meaning of ^ and $ can change sometimes.

These two metacharacters can also be used simultaneously in a pattern, and in fact, that's often how they are used. It would probably help to show an example:

/^abc$/

This pattern, after being taken apart, is quite simple. It is saying that a string must begin, then match a 'abc', then match the end. Basically, the string must be 'abc' in order to match.

The Grouping Metacharacters (( ))
The ( and ) characters are both used in a concept called grouping. The ( character begins a group, and the ) character ends it. Every ( must have a closing ).
This comes into play not only in capturing things to be used later, but it also is the only way to make other metacharacters operate on more than one character. Normally, each character is its own little subpattern on which other metacharacters can operate, but the parentheses change that. They allow more 'complex' subpatterns to be made. Any reference to the word subpattern from now on means a single character subpattern or a parenthesized group.

Any examples show right now would be rather pointless, but here's one:

/(abc)/

Right now, this would match the same as /abc/, but that's going to change after the explanation of other metacharacters.
Another use of grouping is to create backreferences, which actually allow you to match things based on what previous groups captured. This comes into play in substitution and more advanced syntax, but keep it in mind.

The Quantifying Metacharacters ( * + ? { } )
The quantifiers are among the most important regex metacharacters. They dictate how often a certain letter (or group!) can or must appear. They certainly make patterns more dynamic than just matching single characters.

The * quantifier says that the preceding subpattern must appear 0 or more times, which basically means that it can appear, and if it does, it doesn't matter how many there are. It's often used to account for random whitespace in a string, but it has other uses as well. For now, let's look at a simple pattern:

/ab*c/

This pattern would match ac, abc, abbc, abbbc, etc. If you want to have * operate on more than one character, you'd need to use those grouping metacharacters that were mentioned earlier (I told you they'd come in handy!):

/a(bcd)*/

This would match a, abcd, abcdbcd, abcdbcdbcd, and so on.

The + quantifier operates just like the *, except it dictates that the preceding subpattern must appear one or more times. It tells the Engine that a certain subpattern must appear, and if it does, it can repeat indefinitely.

/c.+t/

Here, you can see some of the real power of quantifiers. They can be used with any character, including metacharacters like the dot, in this case. This pattern would match cat, caaaaat, cbbbajsduasuut, cjkallskt, etc.

The ? quantifier makes the preceding subpattern optional, meaning the preceding group can appear zero or one times.

/a(bcd)?e/

This would match either abcde or ae, because the ? makes the (bcd) subpattern optional.

The { and } metacharacters are used to specify even more exact quantities for subpatterns. They have several different syntax options to accomplish different things, and they are as follows:

/a(bcd){2}e/ #matches abcdbcde because {2} specifies EXACTLY 2 matches
/a(bcd){2,3}e/ # matches abcdbcde or abcdbcdbcde because {2,3} means 2 or 3 matches, inclusive
/a(bcd){2,}e/ # matches any string with a, 'bcd' repeated AT LEAST 2 times, and an e.  {2,} represents a minimum

By the way, you can't just specify a maximum without a minimum (like {,2}). If you wanted a maximum, you could say {1, max}. Also, an interesting note is that all of the other quantifiers can somehow be represented in terms of { }:

/a(bcd)*e/ is equal to /a(bcd){0,}e/
/a(bcd)+e/ is equal to /a(bcd){1,}e/
/a(bcd)?e/ is equal to /a(bcd){0,1}e/

The *, +, and ? quantifiers are often preferred for readability though.

It's important to note that without using the anchoring metacharacters, ^ and $, a pattern will bring back matches in a string even if there are other characters present. For example, in the following string:

drtabcabcabcpdl

The following pattern would indeed bring back a positive match:

/ab*c/

If you wanted to ensure that a string contains ONLY a certain pattern, you'd need to anchor it:

/^ab*c$/

Now the pattern would only match a, any number of b's, and a c.

The Alternation Metacharacter (|)
The alternation metacharacter is basically equivalent to or in PHP (which is why it looks so much like ||). It tells the Engine to match the stuff separated by |, starting from left to right. The instant that it finds something in the alternation, it breaks out and continues with the pattern. You'll most often use the grouping metacharacters to tell the | which strings to actually operate on, otherwise it could have quite unexpected results. The alternation metacharacter operates as far as the innermost enclosing parentheses.

/(yes)|(no)/

That would match either 'yes' or 'no' in its entirety, but it could have different results if the grouping was left out. It's highly recommended to keep track of how you group things, since it can completely change how the Engine looks at a pattern. Just to help you visualize an example of where grouping in alternation is important to keep track of, I'll show you the following pattern:

/prob|n|r|l|ate/

This pattern would actually match 'prob', 'n', 'r', 'l', or 'ate'. If you wanted to match probate, pronate, prorate, and prolate, you'd use:

/pro(b|n|r|l)ate/

The Character Class Metacharacters ([ ])
Ah, finally we've reached character classes. These are extremely powerful regex concepts to learn and understand, so read this section over a few times if it doesn't click.
Character classes tell the Engine to match any character contained within [ ] as one character. I think it would be best to start off with a basic example, such as:

/c[aeiou]t/

This regex would match cat, cet, cit, cot, and cut, because the [aeiou] class contains those characters in between the c and the t. If I changed the character class to [au], it would only match cat and cut. Now, it's not to say that you couldn't accomplish the same thing with the alternation metacharacter, but it becomes very unwieldy and most of the "cool" functionality of character classes (which will be covered right after this) can't be achieved with it. The previous pattern could have been written as:

/c(a|e|i|o|u)t/

But who actually wants to type that?

The cool part about character classes is ranges. Inside of a character class, you can specify ranges (separated by a -) to match. If you wanted to match a string containing any 5 digit number, you could write this pattern:

/([0-9]{5})/

Acceptable ranges are a-z, A-Z, 0-9, and some other ranges involving the actual "value" of certain characters, but that's a bit advanced. Another important thing to mention is that ranges can be "stacked" inside of a single class. The following example illustrates that.

/^[a-zA-Z0-9_]+$/

That pattern would dictate that a string must contain any amount of only alphanumeric characters and the underscore (_), due to the anchors (^ and $), the quantifier (+), and the character class ([a-zA-Z0-9_]).

There are also "shortcut" character classes which the Engine understands automatically. They are as follows:

/\d/ #matches any digit
/\D/ #matches any NON-DIGIT
/\w/ #matches any word character (which includes the underscore and digits, so it's like [a-zA-Z0-9_])
/\W/ #matches any NON-WORD character
/\s/ #matches any whitespace character like a literal space, a tab, and a newline
/\S/ #matches any NON-WHITESPACE character

These shortcuts can be used both inside and outside of actual character classes, meaning they can appear anywhere in a pattern.
So, with this knowledge, we can shorten that "match 5 digits" pattern to:

/\d{5}/

Learn the shortcuts, as they'll help you a lot when you're actually writing patterns of your own. An example of using a shortcut inside of a character class would be:

/^[a-zA-Z\s]+$/

This pattern would match a string containing a-z, A-Z, and any space characters.

There are also some other tiny nuances with character classes that you should really familiarize yourself with. If a character class starts with a ^, it no longer means the beginning of the string, but instead, it acts as ! (NOT) does in PHP. It negates the character class.

/c[^au]t/

That would match cbt, c$t, c!t, crt, etc, but not cat and cut.

Another interesting point to mention is that the . loses its metacharacter properties inside of a character class, meaning you can use it as a literal period inside of a class.

Metacharacter Conclusion

That just about sums it up for the metacharacters (there's still some advanced syntax involving a few metacharacters, but it's nothing to worry about yet). Remember that all of these metacharacters can be used at once in a pattern, allowing you full control of how you match your string.

Quantifier Greediness

I felt that this topic deserved a page of its own. When using quantifiers, there is a concept known as greediness and laziness which can cause a lot of confusion for newcomers to regular expressions.

What is greediness?
Greediness is how the Engine interprets the "jurisdiction" of a quantifier. Let me set up a quick scenario:

You have the string 'exasperate'. You run the following pattern on it:

/e(.*)e/

Believe it or not, but that (.*) actually matches xasperat instead of xasp as you may have thought. The regular behavior for quantifiers is to gobble up as many characters as it possibly can, hence greediness. It wants to grab as many characters as it can possibly get away with and still match. That's why it goes right past the second e in exasperate and keeps on matching until it reaches the last possible e it can.

Making the Match Lazy
This default behavior can be changed by adding a ? after a quantifier. In order to have (.*) match 'xasp' in the previous scenario, you could use:

/e(.*?)e/

That tells the engine to take as much as it needs to succeed on the match, and nothing more. Another way that you might see greediness being countered is by using negative character classes, but that only works if there's only one character you want to prevent greediness from. For example, in an HTML-matching pattern, you wanted to get the text in a certain <p> tag, which just so happens to be followed by another <p>, like in this tiny snippet:

<p id="test">p1</p> <p id="test">p2</p>

You could write your pattern like this:

!<p id="test">(.+)</p>!

This would, not surprisingly, gobble up BOTH <p> tags, even though it's really mismatching the closing tag. The Engine doesn't realize that, and it likes being greedy, so it does. You could rewrite the pattern as:

!<p id="test">(.+?)</p>!

Or, you could use a negative character class and say:

!<p id="test">([^<]+)</p>!

The latter is often slightly quicker in terms of execution speed, but can be more difficult to understand.

Another great use for negative character classes is when you're trying to get the information from a specific HTML tag's attribute. The following pattern would get an img tag's src attribute.

!<img(.+?)src="([^"]+)"(.*?) />!

Now, that's a really complicated pattern, but when you break it down, it's actually quite simple. First, you want to match the literal pattern '<img'. Then, you match any amount of characters (non-greedy), until you reach the src attribute. Then, you use a negative character class ([^"]+) in order to grab everything up to the ending ". Then you have any amount of characters, followed by the standard way to close an img tag.

The concept of greediness and laziness is very important to learn in order to get the results you want when we actually get to using the patterns matched (that's the subject of the next tutorial, actually). Re-read this page as many times as it takes until you completely understand the concept.

Pattern Modifiers

Remember when I was talking about reasons why a regular expression must have delimiters? It's not only to contain the regular expression, but it's also to allow for use of special pattern modifiers that go after the ending delimiter. These allow you to modify how the Engine actually views your pattern. In this section, I'm going to go over every modifier that's commonly used in match regexes. In the next tutorial, when substitution is covered, I'll go over the modifiers that work for substitution, as there are some differences.

The Insensitivity Modifier (i)
If you use the //i modifier, case insensitivity is enabled for the entire regular expression. Take, for example, the following regular expression:

/super/

That would match 'super', but not 'SuPeR' or 'SUPER'. In order to have it match all of those possibilities, you could use a lot of alternation or some clever character classes, or you could just apply the i modifier:

/super/i

Note how the i went AFTER the closing delimiter.

The Newline Match Modifier (s)
Back when I explained the Catch-All Modifier, ., I said that it would NOT match newlines. The //s modifier allows it do to so. Consider the following scenario.
You have a file with the following contents:

something
//start
STUFF!
some more stuff...
//end

If you wanted everything between those two comments (//start and //end), you would write your regex like this:

!//start(.+?)//end!s

Now, pay close attention to what I did. Since I actually needed to use the / character in my pattern, it made no sense to use it as the delimiter, because then it would need to be escaped every time I used it (/\/\/start(.+?)\/\/end/s), and it's incredibly hard to read, so I used an exclamation point as the delimiter. The s modifier on the end allows the pattern to match all of that stuff in between the two comments even though they're on separate lines.

The Multiline Mode Modifier (m)
This modifier is kind of strange, to be honest. It actually changes the behavior of those two anchoring metacharacters (^ and $). Normally, they'd match at the beginning and end of a string, respectively, but with the //m modifier, they actually match next to \n (newlines). You probably won't find yourself using this too often, unless you maybe wanted to ensure that every line in a file had, for example, only 5 digits per line:

/^\d{5}$/m

The Freespace Modifier (x)
This is more of an advanced modifier, and it makes the Engine ignore ANY whitespace inside a regex. This allows you to actually create comments inside of a regex and freely space it to make it easier on the eyes. For example, the following regex (which illustrates some concepts that I haven't gone over yet, but don't worry about it, it's just to demonstrate free spacing):

/\b(\w\S+)(\s+\1)+\b/i

Can be written as:

/
	/b		#word boundary
	(\w\S+)		#word "chunk"
	(
		\s+	#whitespace
		\1	#same word "chunk"
	) +		#repeat if necessary
	\b		#boundary
/xi

Certainly more readable, right? Comments can be placed on the end of the line with a # and then your remarks.

Where to Find More Modifiers
You can always look at http://us3.php.net/manual/en/reference.pcre.pattern.modifiers.php in order to see every supported pattern modifier. Many of these are hardly used, so I deemed it unnecessary to show them all to you, but if ever need to look one up, you can do it on that page. Remember, always consult the manual when you're in doubt about something. Also, keep in mind that there's actually a modifier that can ONLY be used when substituting, and I'll introduce it in the next tutorial, since you wouldn't be able to really visualize what it could do yet.

PCRE vs. POSIX

This is more of a technical part of the tutorial that I felt should be covered (thanks zanus!). There are actually two "flavors", if you will, of regular expressions supported by PHP. They are Perl-Compatible Regular Expressions (PCRE) and POSIX Extended regular expressions. PCRE is much more robust than POSIX, and it can do so many more things that POSIX simply can't even come close to. They're VERY similar in syntax (for the most part, until you get to advanced syntax, which will be in a later tutorial), but PCRE has a lot more functionality. I thought that I'd outline some of those differences here, so you know why you should most certainly learn PCRE over POSIX.

Binary data
PCRE can be used on binary data, whereas POSIX cannot. PCRE can function with all of the individual bytes and characters of any string, be it text or binary, which is a huge plus.

Speed
POSIX has the potential to be much slower than PCRE. I'm not going to go into much more detail than that because I'd rather not confuse people.

Modifiers and Delimiter
POSIX does not support ANY modifiers on their patterns, but they also do not use delimiters like PCRE. The only option that you have with POSIX is case-insensitivity, but it even uses a completely different function, which I could hardly call optimal. PCRE's modifiers extend the regex language's capabilities in countless ways.

Deprecation in PHP6
Here's one of the most important reasons to not use POSIX. POSIX regexes and the related functions (ereg(), eregi(), ereg_replace(), and eregi_replace()) will no longer be a part of PHP6. They will be available ONLY in an extension through PECL (which is basically PHP extensions written directly in C for optimization). Many people won't be able to add this extension, especially on shared hosts, so POSIX is pretty much pointless to learn.

Usability in Perl
PCRE regular expressions were modeled after Perl's regular expression engine (hence Perl-Compatible), and can often be ported directly over to Perl, provided you know how to actually use regular expressions in Perl. Many of today's languages that support regular expressions also use some form of PCRE, so you only need to learn one flavor of regex in order to use it in many different languages.

Putting it All Together

Well, now that we have the basic syntax out of the way, I'm going to put some example patterns up for you to analyze, and then show you exactly what they do.

/^\w+:(\s+\w+)\s+\d+$/m

Match a word, a colon, a space, a word, a space, and some digits on every line

/pro(b|n|r|l)ate/i

Matches probate, pronate, prorate, prolate, without case sensitivity

/^[+-]?\d+$/

Matches an integer which can have + or - (or even nothing) in front of it. It can also have leading zeros because 0 is included in \d.

~//start\n(.+?)\n//end~is

Matches //start, a newline, any amount of characters on any amount of lines (//s modifier), a newline, and //end

These are just some example patterns. When I get into advanced syntax, you'll be able to create much more intricate patterns, such as:

/(\d)(\d{3})(?!\d)/
replacement: $1, $2

I hope some of these basic patterns have gotten you more interested in pursuing regular expressions. ;)

Conclusion and Future Tutorials

This tutorial was meant to be a (comprehensive) tutorial of the most basic regular expression syntax. There are still many more advanced concepts, but that will be the subject of maybe the 3rd or 4th tutorial in this set. The next tutorial will involve actually applying these patterns in PHP, creating matches and using the grouping metacharacters to create match groups, substitution, and some other concepts directly related to regex use in PHP. Then, I'll cover all of the advanced concepts in order to help you create more efficient and more specific patterns.

Just to show you the utility of regular expressions, I actually had to use one to correct some of the tags that I used to show you the regular expressions. It looked a bit like this:

!\[code(=php)?\](.+?)\[/code\]!is

That found all of the code tags for me, so I could easily use a replacement pattern to make them into the proper tags. ;)