Jump to content

Splitting a string on (only) the last instance of a needle


stubarny

Recommended Posts

Hello,

 

I need to find the target html address within some raw html. I know it lies between Needle 1 and Needle 2 below.

 

Needle 1:

</span></a>   <a href="

 

Needle 2:

" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div>

 

Nornally I would just use the explode() function but my headache is that there are many instances of Needle 1 in the raw html code and I want my output array always have the same number of elements. However I do know that my target html address is directly after the last instance of Needle 1.

 

How should I go about splitting the html code into an array that always has 3 elements? I'd like the first element to include code up to and including the last instance of Needle 2, the 2nd element to have the target html address, and the 3rd element to have Needle 2 and any following code.

 

Thanks for any pointers (my main struggle is identifying the last instance of needle 1),

 

Thanks,

 

Stu

Link to comment
Share on other sites

Or, learn how to program like the big boys and use regular expresssions

 

$pattern = '#</span></a>   <a href="([^"]+)" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div>$#';

$output = preg_match($pattern, $subject, $matches);

$lastMatchedValue = $matches[1];

Link to comment
Share on other sites

Regular Expressions are not a substitute for basic string functions! Grumble grumble grumble.

 

The RegEx engine is slow, and isn't designed for simple things like this. Ignore mjdamato, he will lead you down the path to the dark side... ;)

Link to comment
Share on other sites

I am not saying regex is a substitute for string functions. but in this case it would take several string functions to get the value. First you have to find the position of the last instance of needle 2, then find the instance of the end of needle 1. Lastly, you would get the value by using substr() using the position of needle 1 (+ the length of needle 1) as the start position, then you have to calculate the start of needle 2 minus the start position.

 

Seems like an awful lot of work for something I can do in one line - even if it isn't the most efficient function. In fact, just to see what the performance hit would be I built two solutions, one using Regex and the other with string functions. The string solution was about 2x as fast as the regex solution. But, we are talking very, very miniscule amounts of time. For 10,000 iterations the string function was averaging around .04 seconds of a second while the regex solution took about .08 seconds. So, the performance benefit is definitely there for string functions. But, the string solution took my much longer to put together and involved several steps. More steps mean more potential for bugs and regressions.

Link to comment
Share on other sites

So we can conclusively say on a fairly idle machine there is double the performance out of string functions.

I don't see the number getting smaller on a server with full load ;)

 

I don't think this is much more complex than the code you've posted.

Shorthand

$offset = strrpos( $body, '</span></a>   <a href="' ) + 33;
$chunk = substr( $body, $offset, strpos($body,'"',$offset)-$offset );

 

Extended

$open = '</span></a>   <a href="';
$close = '"';
$offset = strrpos( $body, $open ) + strlen($open);
$chunk = substr( $body, $offset, strpos($body,$close,$offset)-$offset );

 

Considering that can easily be changed into a reusable function, and is TWICE as fast as RegEx, I don't really see an argument

Link to comment
Share on other sites

Considering that can easily be changed into a reusable function, and is TWICE as fast as RegEx, I don't really see an argument

 

Well, after looking back at the requirements I see that the "string" processes I built and the one you provided, as well as the regex function, would not work. The requirements are that the string being searched for is the last occurrence which starts with needle one AND ends with needle 2. The above string processes don't even look for needle 2. I didn't even test with sample data that includes needle 1 after the actual text being sought. Both my functions, and yours, would fail.

 

You could still create a process using only string functions, but it would be more involved and probably include loops. Working from the back, you would have to find an instance of needle 1 that precedes an instance of needle 2 AND ensure it doesn't include a double quote between those two positions. However, I was able to make one simple change to the regex to find the target value with certainty

function findTarget($subject)
{
    $pattern = '#</span></a>   <a href="([^"]+)" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div>#';
    $result = preg_match_all($pattern, $subject, $matches);
    if(!$result) { return false; }
    return array_pop($matches[1]);
}

 

Yes, it could be done with just string functions, but would be more complicated than the above. When it is all said and done, I don't think that .000004 seconds on a single operation is worth getting worked up about.

 

If you are so inclined to try and build an operation of just string functions that meets the requirements, here is an input that would find the wrong text with the previous functions. The previous functions would find the text "THIS_IS_A_TRAP" instead of the correct text "CORRECT_TARGET".

$subject = '</span></a>   <a href="NOT_THIS_ONE" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div></span></a>   <a href="NOT_THIS_ONE_EITHER" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div></span></a>   <a href="CORRECT_TARGET" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div>';
$subject .= '<a href="somesite.htm"><span>Hyperlink Text</span></a>   <a href="THIS_IS_A_TRAP">other</a>';

 

Link to comment
Share on other sites

Wow, after reading the OP more carefully you are correct.

 

This makes RegEx A LOT more complicated though, considering he wants...

The text from the end of the second last instance of needle2, including needle1

The target string

All the text following the target string to the end of the document.

... while matching needle1 and needle2 where needed.

 

Here's a summary of my RegEx woes. If you can find a way around this, let me know! I'd love to learn.

 

All of the RegEx below is using free-spacing, hence the '\ ' being used.

 

First, I match an instance of needle2, followed by as little HTML as possible and an instance of needle1. I save needle2 into a capturing group to be referenced later.

("\ rel="nofollow"><span\ class=pn><span\ class=np>Next »</span></span></a></div>)
(.*?)
(</span></a>  \ <a\ href=")

 

I then match as little as possible until needle2 is found again. I know I could use ([^"]++) to be faster, but I wanted this to be extremely portable. I also use a positive lookahead to get a zero-width match. If I don't use a zero-width match, the RegEx will only capture odd-numbered instances. This happens because the end of this match could also be the start of the next match.

(.*?)
(?=\1)

 

Here's where things blow up. If I want to capture the rest of the document, it will also include any matches after the first. I could ignore that, return the matches with offsets, and simply grab everything after the last offset, but then I'm using string functions anyways. Even my reversal method requires the use of string functions and extra lines of code.

 

TL;DR - Here's the script I came up with. I have been unable to do this with RegEx alone - the only solution I managed involved reversing the string, which I have included in the example.

 

<?php 

$subject = '</span></a>   <a href="NOT_THIS_ONE" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div>A WHOLE BUNCH OF CRAP IN BETWEEN</span></a>   <a href="NOT_THIS_ONE_EITHER" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div>A WHOLE BUNCH OF CRAP IN BETWEEN</span></a>   <a href="CORRECT_TARGET" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div>';
$subject .= 'A WHOLE BUNCH OF CRAP IN BETWEEN<a href="somesite.htm"><span>Hyperlink Text</span></a>   <a href="THIS_IS_A_TRAP">other</a>';

$needle1  = '</span></a>   <a href="';
$needle2 = '" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div>';

print_r( findTargetRegEx( $needle1, $needle2, $subject ) );

echo "\n";

print_r( findTargetString($needle1,$needle2,$subject) );

// Arguments: needle1, needle2, string to search
function findTargetRegEx( $n1, $n2, $str ) {

preg_match(
	'%^(.*?)(' .strrev($n2). ')(.+?)' .strrev($n1). '(.*?)\2%s',
	strrev( $str ),
	$match
);

return array(
	$n2 . strrev($match[4]) . $n1,
	strrev($match[3]),
	$n2 . strrev($match[1])
); 

}

function findTargetString( $n1, $n2, $str ) {

// Find second to last instance of needle2 - start of chunk1
$chunk1_start = strrpos( $str, $n2, -strrpos($str,$n2) );
// Find the first instance of needle1 after start of chunk1
$chunk1_end = strpos( $str, $n1, $chunk1_start ) + strlen($n1);
// Find the first instance of needle2 after the end of chunk1
$chunk3_start = strpos( $str, $n2, $chunk1_end );

// Any error checking to make sure the values are correct could go here

// Return the 3 chunks
return array(
	substr( $str, $chunk1_start, $chunk1_end-$chunk1_start ),
	substr( $str, $chunk1_end, $chunk3_start-$chunk1_end ),
	substr( $str, $chunk3_start )
);

}

?>

 

In conclusion, I would definitely pick string functions based off my examples. The RegEx requires a lot of backtracking, especially the first lazy modifier. Every character before the first match (from a reserved perspective) will require two backtracks... so the more content you have after the matching text, the slower the RegEx becomes.

 

mjdamato, please dissect this response. I'd love to know a better way to perform the RegEx, or if I've missed the OPs point completely again. I think I got it right this time :D

 

Fewf. [/walloftext]

Link to comment
Share on other sites

What was wrong with the last solution I provided? It was not very complicated and it worked. The only downsides were that:

 

1. It used preg_match_all and then used the last match. So, it was slightly inefficient in having to find, potentially, multiple matches.

2. For the capture it uses ([^"]+), which works perfectly for this particular use case. But, it wouldn't be portable. If I wanted it to be portable I would just change the match to be (.*?) between needle 1 and needle 2. Not efficient, but portable.

 

This seems to meet the OPs requirements and is portable

function findLastTarget($subject, $needle1, $needle2)
{
    $pattern = "#{$needle1}(.*?){$needle2}#";
    //$pattern = '#</span></a>   <a href="([^"]+)" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div>#';
    $result = preg_match_all($pattern, $subject, $matches);
    if(!$result) { return false; }
    return array_pop($matches[1]);
}

$needle1 = '</span></a>   <a href="';
$needle2 = '" rel="nofollow"><span class=pn><span class=np>Next »</span></span></a></div>';
$targetText = findLastTarget($subject, $needle1, $needle2);

Link to comment
Share on other sites

He asked for:

 

The content including and after the second last occurrence of needle2, and including needle1 (missing in your solution)

The specific URL after the last occurrence of needle1, and directly before the occurrence of needle2

The content of needle2, followed by the content of the rest of the document (missing in your solution)

 

Guess I should have included more in my TL;DR :P

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.