来自DomDocument的nodeValue在PHP中返回奇怪的字符 [英] nodeValue from DomDocument returning weird characters in PHP
问题描述
所以,我试图解析HTML页面,并使用 get_elements_by_tag_name('p');查找段落(
< p>
/ code>
So I'm trying to parse HTML pages and looking for paragraphs (<p>
) using get_elements_by_tag_name('p');
问题是,当我使用 $ element-> nodeValue
奇怪的字符。使用curl将文档首先加载到$ html中,然后将其加载到DomDocument中。
The problem is that when I use $element->nodeValue
, it's returning weird characters. The document is loaded first into $html using curl then loading it into a DomDocument.
我确信它与charsets有关。
I'm sure it has to do with charsets.
下面是一个响应示例:aujourdâ€hui。
Here's an example of a response: "aujourd’hui".
提前感谢。
推荐答案
我有同样的问题,现在注意到loadHTML()不再需要2个参数,所以我不得不找到一个不同的解决方案。在我的DOM库中使用以下函数,我能够从HTML内容中删除有趣的字符。
I had the same issues and now noticed that loadHTML() no longer takes 2 parameters, so I had to find a different solution. Using the following function in my DOM library, I was able to remove the funky characters from my HTML content.
private static function load_html($html)
{
$doc = new DOMDocument;
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
foreach ($doc->childNodes as $node)
if ($node->nodeType == XML_PI_NODE)
$doc->removeChild($node);
$doc->encoding = 'UTF-8';
return $doc;
}
这篇关于来自DomDocument的nodeValue在PHP中返回奇怪的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!