PHP DomDocument-为什么用短划线"–"表示转换为– [英] PHP DomDocument - why is en dash "–" converted to –
问题描述
我正在使用DOMDocument提取一些段落.
I am using DOMDocument to extract some paragraphs.
这是我要输入的初始htm文件的样子:
Here is how my initial htm file that I am impotrting looks like:
<html>
<head>
<title>Toxins</title>
</head>
<body>
<p class=8reference><span>1.</span><span>Sivonen, K.; Jones, G. Cyanobacterial Toxins. In <i>Toxic Cyanobacteria in Water. A Guide to Their Public Health Consequences, Monitoring and Management</i>; Chorus, I., Bartram, J., Eds.; E. and F.N. Spon: London, UK, 1999; pp. 41–111.</span></p>
</body>
</html>
我正在做的事情:
$dom_input = new \DOMDocument("1.0","UTF-8");
$dom_input->encoding = "UTF-8";
$dom_input->formatOutput = true;
$dom_input->loadHTMLFile($manuscript->getUploadRootDir().$manuscript->getFileName());
$paragraphs = $dom_input->getElementsByTagName('p');
foreach ($paragraphs as $paragraph) {
if($paragraph->getAttribute('class') == "8reference") {
var_dump($paragraph->nodeValue);
}
}
"pp.41–111"中的破折号转换为
The dash from "pp. 41–111" is converted to
pp. 41â€"111
有人知道为什么以及如何解决它以便获取utf8 unicode值吗?
Any idea why and how can I fix it in order to get utf8 unicode values?
谢谢.
推荐答案
在我看来,数据是正确的,只是显示不正确.
It looks to me like the data is correct, you're just displaying it incorrectly.
您要输出UTF-8吗?
Are you outputting in UTF-8?
Ã+是经典的显示UTF-8编码的数据,就好像它不是UTF-8一样.
The à + thing is a classic "showing UTF-8 encoded data as if it was other than UTF-8.
例如 如果要输出到Web浏览器,请尝试使用meta标签设置字符集.例如
E.g. If you're outputting to a web browser, try setting the character set with a meta tag. E.g.
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
如果您需要输出非UTF-8格式的内容,则需要先转换为备用字符集.
If you need to output in something other than UTF-8 you'll need to convert into the alternative character set first.
这篇关于PHP DomDocument-为什么用短划线"–"表示转换为–的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!