Blog: PCRE Regex Spotlight: \K

Views:
21274

PCRE Regex Spotlight: \K

Update: This article has been amended to rectify some inaccuracies; chiefly being that \K is an escape sequence, and not an assertion. Thanks to member salathe for his input in this matter.

One escape sequence that doesn’t get much attention is \K. What this handy sequence does is resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence.

Ok, I can see the glazed look in some of your eyes. You may be wondering “Why on earth you would want to reset what you matched in the first place?”

We’ll answer this by building up a series of examples that will ultimately illustrate the benefit of this bizarre behavior. Typically, everything the regex pattern finds for the purpose of replacement (whether it is captured or not) is stored into a variable called $0. In the event of a preg_match statement, if the results are appointed to an array, the entire match is stored in the first array element - index[0].

Therefore, given:

$str = 'It\'s foobar time! ';
preg_match('#bar#', $str, $match);
echo $match[0]; // Output:  bar

$match[0] contains the portion of the subject string which matched the pattern (in this case – bar). Now suppose we wanted to replace bar with pub, but only if foo comes right before it. We have some options.

We could make use of a positive lookbehind assertion:

$str = 'It\'s foobar time! Free drinks at the bar!';
$str = preg_replace('#(?<=foo)bar#', 'pub', $str);
echo $str; // Output: It's foopub time! Free drinks at the bar!

Understand that lookbehind assertions cannot contain an unknown amount of characters (they can contain quantifiers, but not those of an unspecified quantity such as +, * or intervals with a minimum / maximum declaration such as {2,} or {2,4} for instance – single value intervals are acceptable however); they must know the exact fixed-length string they are looking for. Also, lookaround assertions don’t actually consume any characters. Rather, they match their positions instead. This is also known as zero-width assertion. As a result, when dealing with replacements, their character content is not stored in $0 (or array index[0] in the event of a preg_match statement). Therefore we get a nice, clean and uncluttered result.

However, we can also make use of our new escape sequence friend \K instead of using a lookbehind, like so:

$str = 'It\'s foobar time! Free drinks at the bar!';
$str = preg_replace('#foo\Kbar#', 'pub', $str);
echo $str; // Output: It's foopub time! Free drinks at the bar!

The result is the same as in the lookbehind example, just achieved differently. The pattern matched foo but \K reset the starting position, discarding foo. Then, bar is matched and replaced with pub. As you can see, any portion of the pattern prior to \K is not taken into account with regards to replacement in $0.

Benefits of \K over lookbehind assertions

While the above examples’ ouput contain no visual difference between using \K or a lookbehind assertion, \K has one crucial benefit; any quantification that precedes it is not required to be a fixed length. For example, based on the previous sample, suppose we wanted to replace bar only if foo comes before it AND it contains three or more o’s.

$str = 'It\'s foobar time! Free drinks at the fooooobar bar!';
$str = preg_replace('#fo{3,}\Kbar#', 'pub', $str);
echo $str; // Output: It's foobar time! Free drinks at the fooooopub bar!

Ah, now the advantages become more apparent! Unlike the lookbehind, \K isn’t thrown off by quantifiers specifying an unknown quantity! To further demonstrate, say we’re given a string with inconsistent length numbers, delineated by dashes with the requirement to replace every second dash with a space. \K works quite smoothly here:

$str = '346-5654-78-90-3-116';
$str = preg_replace('#\d+-\d+\K-#', ' ', $str);
echo $str; // Output: 346-5654 78-90 3-116

This task would be made more tedious if approached with the use of a lookbehind assertion (this of course assumes that these individually grouped numbers remain the same in length on a case by case basis. If each group of numbers varies in length on a case by case basis, lookbehind assertions will not even be possible, but would be exceptionally easy for \K to handle).

The other nice thing about \K is that while it resets the start of the final matched sequence, it doesn’t effect what is captured before it.

To better illustrate, consider the following example:

$str = 'It\'s foobar time!';
preg_match('#(foo)\Kbar#', $str, $match);
echo '<pre>'.print_r($match, true); // let's see what this array looks like.
Array
(
    [0] => bar
    [1] => foo
)

With regards to $match[0], you can see that \K did just what we expected it to do; it reset the start point of the match. However, we can also clearly see that the capture itself was still retained and stored into $match[1]! Therefore, another benefit of using \K is that any captured group prior to it is cleanly separated from index[0]. This method can save code if this separation is in fact desirable.

The drawback of using \K

Like many other programming aspects there are drawbacks, and \K is no exception. It cannot ‘discard’ only a portion of the pattern prior to it. Put another way, you can be guaranteed that everything prior to \K will not be used in $0 / index[0] (granted, nothing stops you from using captures to capture aspects that come before \K in the pattern).

\K should be viewed as a ‘tool’ performing a very specific purpose, much like anything else. It is up to the programmer to know when it is beneficial to make use of \K.

Conclusion and further reading

By now, it should be clear that the escape sequence \K is somewhat similar to lookbehind assertions, but it isn’t hampered with unknown character lengths, and for fast, clean and simple replacements or matches without the need for typical convoluted values stored in $0 or index[0], this just may be the quickest solution you can employ.

You can read up more about \K and other escape sequences on the backslash reference page [http://ca3.php.net/regexp.reference.backslash] in the php manual [http://ca3.php.net/].