Jump to content

Return Longest Paragraph


FishSword

Recommended Posts

Hiya! ;)

 

I have a file (see attached) that contains basic HTML for a page with multiple paragraphs.

How do I check the length of each paragraph to find out if it is 30 characters or over?

 

If the first paragraph is not equal to 30 characters, then it should move on to the next one, and so on.

 

If a paragraph is found to be the correct length, I then need to extract the text from the <p> tags.

If a none of the paragraph match 30 characters in length, then the code will need to choose the best length paragraph.

 

Any help is greatly appreciated.

 

Many Thanks,

 

FishSword

 

[attachment deleted by admin]

Link to comment
Share on other sites

There various ways to get the text inside paragraphs into variables.  Two that are often used are using the preg_match function, and another would be to use http://www.php.net/manual/en/domdocument.getelementsbytagname.php to load the html in a domdocument and access the paragraph nodes.

 

In either case, once you have the text in a string, you can use strlen to get the length of the strings.

Link to comment
Share on other sites

Hi,

 

Thanks for your reply.

 

How would you achieve this, using preg_match?

I also found out that this can be done using strpos, strlen.

 

Which would you say is the best out of the above two, and how would each solution be achieved?

 

Thanks for your help.

 

FishSword

Link to comment
Share on other sites

This will return the first paragraph that is at least 30 characters.

 

// $text is the string to be searched
//If this is a web page then I would assume you are using something like:
//  $text = file_get_contents('http://somedomain.com/somefile.htm');

preg_match("#<p[^>]*>(.{30,})</p>#i", $text, $match);

$firstParagraph30orMoreCharacters = $match[1];

 

Edit: just realized from your first past that if there is no para 30 or more characters you need the longest of the ones that do exist. Give me a few minutes.

Link to comment
Share on other sites

Here is a function that should do exactly as you want

function getParagraph($input, $minLength)
{
    //Check for 1st paragraph of minimum length
    if(preg_match("#<p[^>]*>(.{{$minLength},})</p>#i", $input, $match))
    {
        //Return 1st para matching min length, if found
        return $match[1];
    }
    //No para of min length found, Check for any paragraphs
    preg_match_all("#<p[^>]*>(.*?)</p>#i", $input, $matches);
    if(count($matches)<0)
    {
        //No pragraphs found
        return false;
    }
    //Find longest paragraph and return it
    $longestPara = '';
    foreach($matches[1] as $para)
    {
        if(strlen($para) > strlen($longestPara))
        {
            $longestPara = $para;
        }
    }
    return $longestPara;
}

//Usage
echo getParagraph($text, 30);

Link to comment
Share on other sites

You want something like this - no RegEx required. Won't work with nested tags.

 

<?php

$str = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Page Title</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>

<body>
<p>12345</p>
<p class="test">12345678901234567890</p>
<p>1234567</p>
<p>123456789012345</p>
</body>
</html>';

echo get_paragraph( $str,'p',30 );

function get_paragraph( $html, $tag, $length ) {
$b = array(0,0);
$i = 0;
while( ($offset = strpos($html,'<'.$tag,$i)) !== FALSE ) {
	$start = strpos($html,'>',$offset);
	$end = strpos($html,'</'.$tag.'>',$start);
	if( ($end-$start-1) >= $length )
		return substr($html,$start+1,$end-$start-1);
	if( ($end-$start-1) > $b[0] - $b[1] )
		$b = array($end,$start+1);
	$i = $end;
}
return substr($html,$b[1],$b[0]-$b[1]);
}
?>

Link to comment
Share on other sites

You want something like this - no RegEx required. Won't work with nested tags.

 

Seriously? You want to loop through the entire string instead of running a simple regex? The function I provided will return the correct result after only one line of code if there are any matches over the minimum length. The remaining lines are only there if it needs to check for the longest match less than the minimum. And the majority of that code is comments

Link to comment
Share on other sites

What do you think your RegEx is doing? Looping through the string :D

 

Mine is simply an alternate way. I think benchmarks would show mine to be slightly more efficient as well, because there is no backtracking required.

 

Variety is the spice of life.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.