Subscribe to PHP Freaks RSS

PCRE Regex Spotlight: \K

Print
by Norm Olsen on Aug 21, 2009 11:36:15 AM - 3,414 views

PCRE Regex Spotlight: \K

One backslash sequence that doesn’t get much attention is \K. What this handy little assertion does is match the position of whatever string comes before it in the pattern, then it in essence resets the match. At that point it starts a new match with whatever comes after \K from the current location in the subject string.

Ok, I can see the glazed look in some of your eyes. You may be wondering “Why on earth you would want to scrap what you matched in the first place?”

We’ll answer this by building up a series of examples that will ultimately illustrate the benefit of this bizarre behavior. Typically, everything the regex pattern finds for the purpose of replacement (whether it is captured or not) is stored into a variable called $0. In the event of a preg_match statement, if the results are appointed to an array, the entire match is stored in the first array element - index[0].

Therefore, given:

$str = 'It\'s foobar time! ';
preg_match('#bar#', $str, $match);
echo $match[0]; // Output:  bar

$match[0] contains the complete pattern (in this case – bar). Now suppose we wanted to replace bar with pub, but only if foo comes right before it. We have a few options.

We could make use of a positive lookbehind assertion:

$str = 'It\'s foobar time! Free drinks at the bar!';
$str = preg_replace('#(?<=foo)bar#', 'pub', $str);
echo $str; // Output: It's foopub time! Free drinks at the bar!

Understand that lookbehind assertions cannot contain an unknown amount of characters (in other words, they cannot contain quantifiers); they must know the exact fixed-length string they are looking for. Also, lookaround assertions (which include all assertions in general, even backslash ones like \K) don’t actually consume any character(s). Rather, they match character positions instead. This is also known as zero-width assertion. As a result, since assertions don’t match the actual characters themselves, when dealing with replacements, their character content is not stored in $0 (or array index[0] in the event of a preg_match statement). Therefore we get a nice, clean and uncluttered result.

However, we can also make use of our new back slashed assertion friend \K instead of using a lookbehind construct, like so:

$str = 'It\'s foobar time! Free drinks at the bar!';
$str = preg_replace('#foo\Kbar#', 'pub', $str);
echo $str; // Output: It's foopub time! Free drinks at the bar!

Same result as the lookbehind example. But what happened here? The pattern matched foo (from a positional standpoint), but \K discarded this (resulting in the entire match being reset). Then, from the current position in the subject string (in this case, the position marker is set between foo and bar), bar is matched and thus replaced with pub. As you can now see, any portion of the pattern prior to \K is not taken into account with regards to replacement in $0.

Benefits of \K over lookbehind assertions

While the above examples contain no difference between using \K or a lookbehind assertion, \K has one crucial benefit; quantifiers can precede it. For example, based on the previous sample, suppose we wanted to replace bar only if foo comes before it AND it contains three or more o’s.

$str = 'It\'s foobar time! Free drinks at the fooooobar bar!';
$str = preg_replace('#fo{3,}\Kbar#', 'pub', $str);
echo $str; // Output: It's foobar time! Free drinks at the fooooopub bar!

Ah, now the advantages become more apparent! Unlike the lookbehind construct, \K isn’t thrown off by quantifiers. As demonstrated in another example, say we’re given a string with inconsistent length numbers, delineated by dashes with the requirement to replace every second dash with a space. \K works very smoothly here:

$str = '346-5654-78-90-3-116';
$str = preg_replace('#\d+-\d+\K-#', ' ', $str);
echo $str; // Output: 346-5654 78-90 3-116

This task would be made more tedious if approached with the use of a lookbehind assertion (this of course assumes that these individually grouped numbers remain the same in thier respective lengths on a case by case basis. If each group of numbers varies in length on a case by case basis, lookbehind assertions will not even be possible, but would be exceptionally easy for \K to handle).

The other nice thing about \K is that while it discards what is matched before it, it doesn’t discard what is captured before it.

To better illustrate, consider the following example:

$str = 'It\'s foobar time!';
preg_match('#(foo)\Kbar#', $str, $match);
echo '<pre>'.print_r($match, true); // let's see what this array looks like.

Output:

Array
(
    [0] => bar
    [1] => foo
)

With regards to $match[0] (which by default stores the whole pattern), you can see that \K did just what we expected it to do; it discarded what came before it. However, we can also clearly see that the capture itself was still retained and stored into $match[1]! Therefore, another benefit of using \K is that captured group(s) prior to it is cleanly separated from index[0]. This method can save code if this separation is in fact desirable.

The drawback of using \K

Like many other programming aspects there are drawbacks, and \K is no exception. It cannot discard only a portion of the pattern prior to it. Put another way, you can be guaranteed that everything prior to \K in the pattern will not be consumed and thus not used in $0 / index[0]. Contrast this to the lookbehind assertion, which only concerns itself with what is contained within its parenthesis.

So \K should be viewed as a ‘tool’ performing a very specific purpose, much like anything else. It is up to the programmer to know when it is beneficial to make use of \K.

Conclusion and further reading

By now, it should be clear that the backslash assertion \K is similar to lookbehind assertions, but it isn’t hampered with unknown character lengths, and for fast, clean and simple replacements or matches without the need for typical convoluted values stored in $0 or index[0], this just may be the quickest solution you can employ.

You can read up more about character type sequences and assertions in the backslash reference page in the php manual.

Comments

Timothy McKeown Aug 21, 2009 1:12:54 PM

Great article Norm. The use and purpose of \K is very apparent and clear in your explanations and examples. I just hope I remember to use it when the task calls for it, as rare as it may seem to occur. Thanks!

Norm Olsen Aug 21, 2009 3:03:18 PM

Thanks Tim!

I agree, the proper circumstances must really be present to capitalize on \K's use. It's yet another tool in the tool box in the event it comes in handy :)

raccoon Sep 21, 2009 6:13:14 AM

Thanks a lot. I've run into this problem (with lookbehind asserts) all the time. This is definitely a great tip, thanks!

Add Comment

Login or register to post a comment.