字符编码问题与PHP简单的HTML DOM解析器 [英] Character Encoding issue with PHP Simple HTML DOM Parser
问题描述
我正在使用PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/
来获取其他域的页面标题,元描述和元标记等数据,然后将其插入数据库。
但是我有一些编码问题。问题是我没有从那些不是英文的网站得到正确的字符。
以下是代码:
<?php
require'init.php';
$ curl = new curl();
$ html = new simple_html_dom();
$ page = $ _GET ['page'];
$ curl_output = $ curl-> getPage($ page);
$ html-> load($ curl_output ['content']);
$ meta_title = $ html-> find('title',0) - > innertext;
打印$ meta_title。 < hr />;
// print $ html-> plaintext。 < hr />;
?>
facebook.com的输出
p>
欢迎来到Facebook - 登录,注册或了解更多信息
amazon.cn的输出
页
亚é©éé€ Š-ç½'上è'物商城:è|ç½'è',å°±æ¥Z.cn!
输出 mail.ru
页
Mail.Ru:поч Ñ,а,поиÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÐÑÑÑÐÐÐÐÐÐи¸¸¸¸¸¸¸¸¸¸¸¸¸¸ $ c>
所以,字符没有被正确编码。
任何人都可以帮我解决这个问题,以便我可以在我的数据库中添加正确的数据。
@deceze和@Shakti感谢您的帮助。对于由deceze发布的文章链接( +1 ck /rel =nofollow noreferrer>处理Unicode前端到Web应用程序
阅读您的意见后,回答当然这两篇文章,我终于解决了我的问题。
我列出了我迄今为止解决此问题的步骤:
- 添加了
标题('Content-Type:text / html; charlet = utf-8');
在我的init.php文件的顶部, - 将我的数据库表字段的CHARACTER SET更改为将这些值存储到UTF-8,
- 将MySQL连接字符集设置为UTF-8
mysql_set_charset('utf8',$ connection_link_id);
- 使用htmlentities()函数转换字符
$ meta_title = htmlentities(trim($ meta_title_raw),ENT_QUOTES,'UTF-8');
li>
现在的问题似乎已经解决了,但是我仍然需要做以下事情来解决这个问题在FULL。
- 从源代码获取编码的字符集
$ source_charset
。 - 如果字符串的编码不是相同的编码,则将该字符串的编码更改为UTF-8。为此,唯一可用的PHP函数是
iconv()
。示例:iconv($ source_charset,UTF-8,$ meta_title_raw);
要获得 $ source_charset
我可能需要使用一些技巧或多重检查。像检查标题和元标记等。我发现一个很好的答案在检测编码< a>
如果有任何改善或上述步骤有任何错误,请告知我。
I am using PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/
to fetch data like Page Title, Meta Description and Meta Tags from other domains and then insert it into database.
But I have some issues with encoding. The problem is that I do not get correct characters from those website which is not in English Language.
Below is the code:
<?php
require 'init.php';
$curl = new curl();
$html = new simple_html_dom();
$page = $_GET['page'];
$curl_output = $curl->getPage($page);
$html->load($curl_output['content']);
$meta_title = $html->find('title', 0)->innertext;
print $meta_title . "<hr />";
// print $html->plaintext . "<hr />";
?>
Output for facebook.com
page
Welcome to Facebook â€" Log in, sign up or learn more
Output for amazon.cn
page
亚马逊-网上è´ç‰©å•†åŸŽï¼šè¦ç½‘è´, å°±æ¥Z.cn!
Output for mail.ru
page
Mail.Ru: почта, поиÑк в интернете, новоÑти, игры, развлечениÑ
So, the characters is not being encoded properly.
Can anyone help me how to solve this issue so that I can add correct data into my database.
@deceze and @Shakti thanks for your help.
+1 for the article link posted by deceze (Handling Unicode Front to Back in a Web App) and it also worth reading Understanding encoding
After reading your comments, answer and of course those two articles, I finally solved my issue.
I have listed the steps I did so far to solve this issue:
- Added
header('Content-Type: text/html; charset=utf-8');
on the top of my init.php file, - Changed CHARACTER SET of my database table field which is storing those value to UTF-8,
- Set MySQL connection charset to UTF-8
mysql_set_charset('utf8', $connection_link_id);
- Used htmlentities() function to convert characters
$meta_title = htmlentities(trim($meta_title_raw), ENT_QUOTES, 'UTF-8');
Now the issue seems to be solved, BUT I still have to do following thing to solve this issue in FULL.
- Get the encoded charset from the source
$source_charset
. - Change the encoding of the string into UTF-8 if it is already not in the same encoding. For this the only available PHP function is
iconv()
. Example:iconv($source_charset, "UTF-8", $meta_title_raw);
For getting $source_charset
I probably have to use some tricks or multi checking. Like checking headers and meta tag etc. I found a good answer at Detect encoding
Let me know if there are any improvements or any fault on my steps above.
这篇关于字符编码问题与PHP简单的HTML DOM解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!