PHP中的DOM:解码实体并设置nodeValue [英] DOM in PHP: Decoded entities and setting nodeValue

查看:217
本文介绍了PHP中的DOM:解码实体并设置nodeValue的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 DOM对PHP的XML文档执行某些操作其标准库的一部分。由于其他人已经发现,所以必须处理已解码的实体。为了说明什么让我烦恼,我举一个简单的例子。



假设我们有以下代码

  $ doc = new DOMDocument(); 
$ doc-> loadXML(< XML data>);

$ xpath = new DOMXPath($ doc);
$ node_list = $ xpath-> query(< some XPath>);

foreach($ node_list as $ node){
// do something
}

如果循环中的代码类似于

  $ attr =< some string>; 
$ val = $ node-> getAttribute($ attr);
//用$ val
$ node-> setAttribute($ attr,$ val);

它工作正常。但是如果它更像是

  $ text = $ node-> textContent; 
//用$ text
$ node-> nodeValue = $ text;

$ text 包含一些解码的>& ,它不会被编码,即使没有对 $ text 什么也没有。



目前,我在<$ ...上申请 htmlspecialchars 在将$ code> $ node-> nodeValue 之前,c $ c> $ text 现在我想知道


  1. 如果这是足够的,

  2. 如果没有,

  3. ,如果有更优雅的解决方案,就像属性操纵一样。

我必须处理的XML文档主要是Feed,所以一个解决方案应该是一般的。






编辑



原来,我的原始问题范围错误,抱歉。在这里,我提供一个描述的行为实际发生的例子。

  $ ch = curl_init(); 
curl_setopt($ ch,CURLOPT_URL,http://feeds.bbci.co.uk/news/rss.xml?edition=uk);
curl_setopt($ ch,CURLOPT_RETURNTRANSFER,1);
$ output = curl_exec($ ch);
curl_close($ ch);

$ doc = new DOMDocument();
$ doc-> loadXML($ output);

$ xpath = new DOMXPath($ doc);
$ node_list = $ xpath-> query('// item / link');

foreach($ node_list as $ node){
$ node-> nodeValue = $ node-> textContent;
}
echo $ doc-> saveXML();

如果我在CLI上执行此代码,

  php beeb.php | egrep'link |警告'

我得到的结果如


< link> http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss</链接>


应该是


< link> http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</link >


(如果省略循环)和根据警告


警告:main():未终结的实体引用ns_source = PublicRSS20-sa in / private / tm p / beeb.php在第15行


当我将 htmlspecialchars code> $ node-> textContent ,它工作正常,但我觉得很不舒服。

解决方案

由于 hakre 解释,问题是在PHP的DOM库中,设置 nodeValue wrt实体取决于节点的类别,特别是 DOMText DOMElement 在这方面有所不同。
为了说明这个,一个例子:

  $ doc = new DOMDocument(); 
$ doc-> formatOutput = True;
$ doc-> loadXML('< root />');

$ s ='text& amp;& lt;<\'& text;& text';

$ root = $ doc- > documentElement;

$ node = $ doc-> createElement('tag1',$ s); #line 10
$ root-> appendChild($ node);

$ node = $ doc-> createElement('tag2');
$ text = $ doc-> createTextNode($ s);
$ node-> appendChild $ text $;
$ root-> appendChild($ node);

$ node = $ doc-> createElement('tag3');
$ text = $ doc-> createCDATASection($ s);
$ node-> appendChild($ text);
$ root-> appendChild($ node);

echo $ doc-> saveXML();

输出



警告:DOMDocument :: createElement():第10行/tmp/DOMtest.php中的未终结实体引用文本
<?xml version =1.0?> ;
< root>
< tag1> text&& lt;& lt;'& text;< / tag1>
< tag2> text& amp; amp& amp; amp; amp; amp;& amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; <![CDATA [text&& amp;<& text;& text]]>< / tag3>
< / root>






在这种特殊情况下, DOMText 节点的 nodeValue 。结合 hakre的两个答案得到一个非常优雅的解决方案。

  $ doc = new DOMDocument(); 
$ doc-> loadXML(< XML data>);

$ xpath = new DOMXPath($ doc);
$ node_list = $ xpath-> query(< some XPath>);

$ visitTextNode = function(DOMText $ node){
$ text = $ node-> textContent;
/ *
用$ text
* /
$ node-> nodeValue = $ text
};

foreach($ node_list as $ node){
if($ node-> nodeType == XML_TEXT_NODE){
$ visitTextNode($ node);
} else {
foreach($ node-> childNodes as $ child){
if($ child-> nodeType == XML_TEXT_NODE){
$ visitTextNode($ child );
}
}
}
}


I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decoded entities then. To illustrate what bothers me, I give a quick example.

Suppose we have the following code

$doc = new DOMDocument();
$doc->loadXML(<XML data>);

$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);

foreach($node_list as $node) {
    //do something
}

If the code in the loop is something like

$attr = "<some string>";
$val = $node->getAttribute($attr);
//do something with $val
$node->setAttribute($attr, $val);

it works fine. But if it's more like

$text = $node->textContent;
//do something with $text
$node->nodeValue = $text;

and $text contains some decoded &, it doesn't get encoded, even if one does nothing with $text at all.

At the moment, I apply htmlspecialchars on $text before I set $node->nodeValue to it. Now I want to know

  1. if that is sufficient,
  2. if not, what would suffice,
  3. and if there are more elegant solutions for this, as in the case of attribute manipulation.

The XML documents I have to deal with are mostly feeds, so a solution should be pretty general.


EDIT

It turned out that my original question had the wrong scope, sorry for that. Here I provide an example where the described behaviour actually happens.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);

$doc = new DOMDocument();
$doc->loadXML($output);

$xpath = new DOMXPath($doc);
$node_list = $xpath->query('//item/link');

foreach($node_list as $node) {
        $node->nodeValue = $node->textContent;
}
echo $doc->saveXML();

If I execute this code on the CLI with

php beeb.php |egrep 'link|Warning'

I get results like

<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss</link>

which should be

<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</link>

(and is, if the loop is omitted) and according warnings

Warning: main(): unterminated entity reference ns_source=PublicRSS20-sa in /private/tmp/beeb.php on line 15

When I apply htmlspecialchars to $node->textContent, it works fine, but I feel very uncomfortable doing that.

解决方案

As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText and DOMElement differ in this regard. To illustrate this, an example:

$doc = new DOMDocument();
$doc->formatOutput = True;
$doc->loadXML('<root/>');

$s = 'text &amp;&lt;<"\'&text;&text';

$root = $doc->documentElement;

$node = $doc->createElement('tag1', $s); #line 10
$root->appendChild($node);

$node = $doc->createElement('tag2');
$text = $doc->createTextNode($s);
$node->appendChild($text);
$root->appendChild($node);

$node = $doc->createElement('tag3');
$text = $doc->createCDATASection($s);
$node->appendChild($text);
$root->appendChild($node);

echo $doc->saveXML();

outputs

Warning: DOMDocument::createElement(): unterminated entity reference            text in /tmp/DOMtest.php on line 10
<?xml version="1.0"?>
<root>
  <tag1>text &amp;&lt;&lt;"'&text;</tag1>
  <tag2>text &amp;amp;&amp;lt;&lt;"'&amp;text;&amp;text</tag2>
  <tag3><![CDATA[text &amp;&lt;<"'&text;&text]]></tag3>
</root>


In this particular case, it is appropriate to alter the nodeValue of DOMText nodes. Combining hakre's two answers one gets a quite elegant solution.

$doc = new DOMDocument();
$doc->loadXML(<XML data>);

$xpath     = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);

$visitTextNode = function (DOMText $node) {
    $text = $node->textContent;
    /*
        do something with $text
    */
   $node->nodeValue = $text;
};

foreach ($node_list as $node) {
    if ($node->nodeType == XML_TEXT_NODE) {
        $visitTextNode($node);
    } else {
        foreach ($node->childNodes as $child) {
            if ($child->nodeType == XML_TEXT_NODE) {
                $visitTextNode($child);
            }
        }
    }
}

这篇关于PHP中的DOM:解码实体并设置nodeValue的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆