为什么PHP DOMDocument loadHTML不适用于波斯字符? [英] Why PHP DOMDocument loadHTML doesn't work for Persian characters?

查看:72
本文介绍了为什么PHP DOMDocument loadHTML不适用于波斯字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

<?php

$data = <<<DATA
<div>
    <p>سلام</p>                                         // focus on this line
    <p class="myclass">Remove this one</p>
    <p>But keep this</p>
    <div style="color: red">and this</div>
    <div style="color: red">and <p>also</p> this</div>
    <div style="color: red">and this <div style="color: red">too</div></div>
</div>
DATA;

$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($data, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);

foreach ($xpath->query("//*[@*]") as $node) {
    $parent = $node->parentNode;
    while ($node->hasChildNodes()) {
        $parent->insertBefore($node->lastChild, $node->nextSibling);
    }
    $parent->removeChild($node);
}

echo $dom->saveHTML();

正如我在问题标题中提到的那样,我网站的内容是波斯语(不是英语) )。但是有关波斯语字符的代码不起作用。

As I've mentioned in the title of my question, the content of my website is Persian (not English). But code about doesn't work for Persian characters.

当前输出:

.
.
    <p>&#1587;&#1604;&#1575;&#1605;</p>
.
.

预期输出:

.
.
    <p>سلام</p>
.
.

它有什么问题,我该如何解决?

What's wrong with it and how can I fix it?

注意:另外,如您所见,我已经使用 mb_convert_encoding($ data,'HTML-ENTITIES','UTF-8')使其正确(基于此答案

Note: Also as you see I've used mb_convert_encoding($data, 'HTML-ENTITIES', 'UTF-8') to make it correct (based on this answer) but still it doesn't work.

推荐答案

波斯语字符被编码为数字字符引用。它们将适当地显示在浏览器中,或者您可以使用 html_entity_decode()对其进行解码,例如:

The Persian characters are being encoded as numeric character references. They'll appear appropriately in a browser or you can see the original by decoding them with html_entity_decode(), e.g.:

echo html_entity_decode("&#1587;&#1604;&#1575;&#1605;");

输出:

سلام






如果您喜欢输出中的原始字符而不是数字字符引用,您可以更改:


If you prefer the original characters in the output rather than numeric character references, you can change:

echo $dom->saveHTML();

至:

echo $dom->saveHTML($dom->documentElement);

这会稍微改变序列化,结果是:

This alters the serialization a bit and the result is:

<div>
    <p>سلام</p>
    Remove this one
    <p>But keep this</p>
    and this
    and <p>also</p> this
    and this too
</div>

例如。

这篇关于为什么PHP DOMDocument loadHTML不适用于波斯字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆