Jump to content

accented characters


1zeus1

Recommended Posts

or need to extract all tags <p> from a site in Italian

 

////////////////////////////////////////////////////////////////////////

 

 

 

header('Content-Type: text/html; charset=iso-8859-1');

 

function curl_file_get_contents($url)

{

$curl = curl_init();

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

 

curl_setopt($curl,CURLOPT_URL,$url); //The URL to fetch. This can also be set when initializing a session with curl_init().

curl_setopt($curl,CURLOPT_RETURNTRANSFER,TRUE); //TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.

curl_setopt($curl,CURLOPT_CONNECTTIMEOUT,10); //The number of seconds to wait while trying to connect.

 

curl_setopt($curl, CURLOPT_USERAGENT, $userAgent); //The contents of the "User-Agent: " header to be used in a HTTP request.

curl_setopt($curl, CURLOPT_FAILONERROR, TRUE); //To fail silently if the HTTP code returned is greater than or equal to 400.

curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE); //To follow any "Location: " header that the server sends as part of the HTTP header.

curl_setopt($curl, CURLOPT_AUTOREFERER, TRUE); //To automatically set the Referer: field in requests where it follows a Location: redirect.

curl_setopt($curl, CURLOPT_TIMEOUT, 5); //The maximum number of seconds to allow cURL functions to execute.

 

$contents = curl_exec($curl);

curl_close($curl);

return $contents;

}

$get = curl_file_get_contents($url);

 

 

function getTextBetweenTags($tag, $html, $strict=0)

{

/*** a new dom object ***/

$dom = new domDocument;

/*** load the html into the object ***/

if($strict==1)

{

$dom->loadXML($html);

}

else

{

$dom->loadHTML($html);

}

 

/*** discard white space ***/

$dom->preserveWhiteSpace = false;

 

/*** the tag by its tag name ***/

$content = $dom->getElementsByTagname($tag);

 

/*** the array to return ***/

$out = array();

foreach ($content as $item)

{

/*** add node value to the out array ***/

$out[] = $item->nodeValue;

}

/*** return the results ***/

return $out;

}

 

 

<?php

$content = getTextBetweenTags1('p', $html);

 

foreach( $content as $item )

{

echo $item.'.';

}

?>

 

/////////////////////////////////////////////////////////////////////////////////

My problem is that it does not recognize accented characters.///

////////////////////////////////////////////////////////////////////////////////

 

"con un certo miglioramento del testo e, cosa più importante, con le firme dei sottoscrittori. scusate la ripetizione. che però, dicevano gli antichi, iuvat. un caro saluto. "

 

Or need your help

regards,cristian

Link to comment
Share on other sites

This is the encoding problem. Make sure that your php script and xml file uses the same encoding.

 

The PHP file encoding is here:

header('Content-Type: text/html; charset=iso-8859-1');

 

The XML file encoding should be at the top of your xml file.

 

 

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.