Jump to content

Need help in parsing htm documents


christianbale

Recommended Posts

Hi all,

I need help in parsing a htm documents. Please find my code below

 

<?

 

function strip_html_tags( $text )

{

    $text = preg_replace(

        array(

          // Remove invisible content

            '@<head[^>]*?>.*?</head>@siu',

            '@<style[^>]*?>.*?</style>@siu',

            '@<script[^>]*?.*?</script>@siu',

            '@<object[^>]*?.*?</object>@siu',

            '@<embed[^>]*?.*?</embed>@siu',

            '@<applet[^>]*?.*?</applet>@siu',

            '@<noframes[^>]*?.*?</noframes>@siu',

            '@<noscript[^>]*?.*?</noscript>@siu',

            '@<noembed[^>]*?.*?</noembed>@siu',

          // Add line breaks before and after blocks

            '@</?((address)|(blockquote)|(center)|(del))@iu',

            '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',

            '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',

            '@</?((table)|(th)|(td)|(caption))@iu',

            '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',

            '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',

            '@</?((frameset)|(frame)|(iframe))@iu',

        ),

        array(

            ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',"$0", "$0", "$0", "$0", "$0", "$0","$0", "$0",), $text );

    // strip_tags removes the remaining html tags   

    return strip_tags( $text);

}

 

function strip_random_characters( $text )

{

//This function removes all the rest of the special characters

 

$special_characters = preg_replace(array("/(?![.=$'€%-])\p{P}/","[^-\w\d\s\.=$'€%]", ), array(" "," ",), $text);

 

//$special_characters = preg_replace(array("[^-\w\d\s\.=$'€%]", ), array(" ",), $text);

 

 

$data = str_replace(array(".",",", "/","^","(",")","'","-","0","1","2","3","4","5","6","7","8","9","×","¢","‘","¨","™","ª","à","¤","®","¥","€","Ð","Ñ","Š","»","°","Ä","Œ","Ã","±","§","•","¤","¥","€","¿","¡","‡",), array(" "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " "," "," "," "," "," "," "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", ), $special_characters );

 

return $data;

}

 

 

function upper_to_lower($string) {

 

//This function converts everything to lower case...

$doc = str_replace(array("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z",), array("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",), $string);

 

return $doc;

}

 

 

This code removes html tags completely. But still I'm getting special characters in the processed document. "  â ² â ³n    â ² â ³wï    ï       n           wï lpg=pa    amp dq=%  only+fjord+on+the+east+coast%  v=onepage amp q=%  only%  fjord%  on% " are the some of the characters in the processed document. I want to remove these special characters.

 

Any help will be appreciated! Thanks

 

 

 

Link to comment
Share on other sites

Just as a side note, you could look at htmlpurifier for removing html tags etc.  I believe it's very thorough.

 

Then, what's wrong with PHP's strtolower function for converting characters to lower case?

 

Finally, do you have a sample document?  I don't see why a single preg_replace() with a whitelist of characters shouldn't work and am curious to try.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.