PHP的字符编码问题简单的HTML DOM解析器 [英] Character Encoding issue with PHP Simple HTML DOM Parser

查看:147
本文介绍了PHP的字符编码问题简单的HTML DOM解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/ 来获取其他网域的网页标题,元说明和元标记等数据,然后将其插入数据库。



但我有一些编码问题。



以下是代码:

 <?php 
require'init.php';

$ curl = new curl();
$ html = new simple_html_dom();

$ page = $ _GET ['page'];

$ curl_output = $ curl-> getPage($ page);

$ html-> load($ curl_output ['content']);
$ meta_title = $ html-> find('title',0) - > innertext;

print $ meta_title。 < hr />;

// print $ html-> plaintext。 < hr />;
?>

输出 facebook.com p>

欢迎来到Facebook - 登录,注册或了解详情



输出 amazon.cn



亚é©éé€ S-ç½'上è'物商城:è|ç½'è',A°±æ¥Z.cn!



输出 mail.ru



Mail.Ru:поч Ñ,а,поиÑкD²Ð¸Ð½Ñ,ÐμрнÐμÑ,Ðμ,новоÑÑ,и,игрÑ<,N€Ð°D·Ð²Ð»ÐμчÐμниÑ



所以,这些字符没有被正确编码。



任何人都可以帮助我如何解决这个问题使我可以添加正确的数据到我的数据库。


解决方案

@deceze和@Shakti感谢您的帮助。

+1(处理Unicode前端至后端在网络应用程序),它也值得阅读了解编码



在阅读您的意见,回答,当然,这两个文章,我终于解决了我的问题。



我列出了我迄今为止做的步骤解决此问题:


  1. 添加了 header('Content-Type:text / html; charset = utf-8');

  2. 更改了我的数据库表字段的CHARACTER SET,它将这些值存储在我的i​​nit.php文件的顶部UTF-8,

  3. 将MySQL连接字符集设置为UTF-8 mysql_set_charset('utf8',$ connection_link_id);

  4. 使用htmlentities()函数转换字符 $ meta_title = htmlentities(trim($ meta_title_raw),ENT_QUOTES,'UTF-8'); li>

现在问题似乎解决了,但是我仍然必须在FULL中解决这个问题。


  1. 从源 $ source_charset 获取编码的字符集。

  2. 如果字符串的编码不在同一编码中,请将其更改为UTF-8。为此,唯一可用的PHP函数是 iconv()。示例: iconv($ source_charset,UTF-8,$ meta_title_raw);

对于获取 $ source_charset 我可能需要使用一些技巧或多重检查。像检查标头和元标记等。我发现了一个很好的答案,在检测编码< a>



如果我上面的步骤有任何改善或任何错误,请与我们联络。


I am using PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/ to fetch data like Page Title, Meta Description and Meta Tags from other domains and then insert it into database.

But I have some issues with encoding. The problem is that I do not get correct characters from those website which is not in English Language.

Below is the code:

<?php
require 'init.php';

$curl = new curl();
$html = new simple_html_dom();

$page = $_GET['page'];

$curl_output = $curl->getPage($page);

$html->load($curl_output['content']);
$meta_title = $html->find('title', 0)->innertext;

print $meta_title . "<hr />";

// print $html->plaintext . "<hr />";
?>

Output for facebook.compage

Welcome to Facebook â€" Log in, sign up or learn more

Output for amazon.cnpage

亚马逊-网上购物商城:è¦ç½‘è´­, å°±æ¥Z.cn!

Output for mail.rupage

Mail.Ru: почта, поиÑк в интернете, новоÑти, игры, развлечениÑ

So, the characters is not being encoded properly.

Can anyone help me how to solve this issue so that I can add correct data into my database.

解决方案

@deceze and @Shakti thanks for your help.

+1 for the article link posted by deceze (Handling Unicode Front to Back in a Web App) and it also worth reading Understanding encoding

After reading your comments, answer and of course those two articles, I finally solved my issue.

I have listed the steps I did so far to solve this issue:

  1. Added header('Content-Type: text/html; charset=utf-8'); on the top of my init.php file,
  2. Changed CHARACTER SET of my database table field which is storing those value to UTF-8,
  3. Set MySQL connection charset to UTF-8 mysql_set_charset('utf8', $connection_link_id);
  4. Used htmlentities() function to convert characters $meta_title = htmlentities(trim($meta_title_raw), ENT_QUOTES, 'UTF-8');

Now the issue seems to be solved, BUT I still have to do following thing to solve this issue in FULL.

  1. Get the encoded charset from the source $source_charset.
  2. Change the encoding of the string into UTF-8 if it is already not in the same encoding. For this the only available PHP function is iconv(). Example: iconv($source_charset, "UTF-8", $meta_title_raw);

For getting $source_charset I probably have to use some tricks or multi checking. Like checking headers and meta tag etc. I found a good answer at Detect encoding

Let me know if there are any improvements or any fault on my steps above.

这篇关于PHP的字符编码问题简单的HTML DOM解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆