Jump to content

Char encoding?


phpsycho

Recommended Posts

I came across some Chinese or Japanese characters in my db. They didn't display correctly though.. what function do I need to use so the characters come out correctly? Not translating, just displaying their text correctly.

Like in my db it has:

¥Ü¡¼¥«¥í¥¤¥É¤Î²Î»ìÃÖ¾ì

 

but what the text actually looks like is this:

ボーカロイドの歌詞置場

Link to comment
Share on other sites

Japanese. I don't think Chinese people are quite as obsessed with Vocaloid.

 

That particular string is in EUC-JP encoding.

The most common international encoding (for handling as many characters from as many alphabets as possible) is probably UTF-8. I suggest you try to use that for everything.

Link to comment
Share on other sites

Its actually a scraper. so its scraping data from websites like their title, desc, keywords, etc.

so I picked up a couple Japanese sites and they have that meta tag, just different charset.

 

So this is my code that I have to convert the title of the site..

 

$charset = 'None';
$description='';
$keywords='';
preg_match("/<head.*>(.*)<\/head>/smUi",$html, $headers);
            if(count($headers) > 0) {

                if(preg_match("/<meta[^>]*http-equiv[^>]*charset=(.*)(\"|')>/Ui",$headers[1], $results)){
                $charset= $results[1];
                } else {
                $charset='None';
                }

                } else {
                $ok=0;
                //echo 'No HEAD - Might be malformed or be a feed<br />';
            }

       
if($charset != 'None'){
$title = mb_convert_encoding($title, "UTF-8", $charset);
}

if($title == null){
$title = $url;
}

 

Shouldn't that fix the problem, using that mb_convert_encoding?

Link to comment
Share on other sites

It might. I've never messed with converting encodings, I always stick with UTF-8. I know that the manual for that function has tons of user comments about odd behavior though

 

Give it a shot, and let us know.

Link to comment
Share on other sites

Thats what this piece of code does...

 

                if(preg_match("/<meta[^>]*http-equiv[^>]*charset=(.*)(\"|')>/Ui",$headers[1], $results)){
                $charset= $results[1];
                } else {
                $charset='None';
                }

 

if($charset != 'None'){
$title = mb_convert_encoding($title, "UTF-8", $charset);
}

Link to comment
Share on other sites

I believe \[0] would be the whole tag, not just the charset.

 

I even did a small bit of code to test and it still don't work..

$url='http://010701070107.blog5.fc2.com/';
$userAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
$buffer = curl_exec($ch);
$curl_info = curl_getinfo($ch);
curl_close($ch);
$header_size = $curl_info['header_size'];
$header = substr($buffer, 0, $header_size);

preg_match("~charset=([^\s]*)\s~is", $header, $header);

$header = mb_convert_encoding("Japanese text", "UTF-8", $header[1]);
echo $header;

 

That just spits out this: ???若?????ゃ?????�臀??

Link to comment
Share on other sites

Here's how I echo'ed your site in proper UTF-8

 

<?php 

$url='http://010701070107.blog5.fc2.com/';
$userAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
//curl_setopt($ch, CURLOPT_HEADER, true);
//curl_setopt($ch, CURLOPT_NOBODY, true);
$buffer = curl_exec($ch);
$curl_info = curl_getinfo($ch);
curl_close($ch);


$expr = '%text/html; charset=euc-jp%';
$buffer = preg_replace( $expr, 'text/html; charset=UTF-8', $buffer );
$buffer =  mb_convert_encoding($buffer,'UTF-8','EUC-JP');
echo $buffer;
?>

Link to comment
Share on other sites

sigh.. still having issues with this.

 

$html = $response;
$url = $info['url'];
$erurl = $info['url'];
$code = $info['http_code'];

$domain = parseHOST($url);

$url = str_ireplace("www.","",$url);
$url = rtrim($url, "/");


if($html && $code >= 200 && $code < 300){

$charset = 'None';
$description='';
$keywords='';
preg_match("/<head.*>(.*)<\/head>/smUi",$html, $headers);
if(preg_match("/<meta[^>]*http-equiv[^>]*charset=([^\"']*)>/Ui",$headers[1], $results)){

$expr = '%charset=([^"\']*)%';
$html = preg_replace( $results[1], 'UTF-8', $results[0]);
$html =  mb_convert_encoding($html,'UTF-8',$results[1]);
}

if(preg_match("/<head.*>(.*)<\/head>/smUi",$html, $headers)){
preg_match("/<head.*>(.*)<\/head>/smUi",$html, $headers);
}

 

Sorry I'm kinda new to the whole charset thing.

its still coming out with ¥Ü¡¼¥«¥í¥¤¥É¤Î²Î»ìÃÖ¾

Link to comment
Share on other sites

I'm scraping a bunch of sites, but this one in particular is http://010701070107.blog5.fc2.com/

I am scraping it for title,desc, keywords.

So I wanna pull the charset and convert it first then pull the important data from the converted charset curl response.

 

and yes yours worked. now I'm just trying to integrate it into my script, but having a hard time doing so

Link to comment
Share on other sites

Okay its not the converting of the data thats the problem..

It works.. but in order for it to work I have to echo the html from the website after echoing the title, desc, and keywords.

If I don't do that then it won't display the converted title, desc, keys correctly.

 

Any clue why thats happening?  :confused:

 

EDIT: uhh okay, it works when it wants to and if it wants to then the html needs to be echoed for it to work. lol wtf

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.