PHP,SimpleXML,CDATA中的解码实体 [英] PHP, SimpleXML, decoding entities in CDATA

查看:99
本文介绍了PHP,SimpleXML,CDATA中的解码实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到以下行为:

$xml_string1 = "<person><name><![CDATA[ Someone&#039;s Name ]]></name></person>";
$xml_string2 = "<person><name> Someone&#039;s Name </name></person>";

$person = new SimpleXMLElement($xml_string1);
print (string) $person->name; # Someone&#039;s Name

$person = new SimpleXMLElement($xml_string2);
print (string) $person->name; # Someone's Name

$person = new SimpleXMLElement($xml_string1, LIBXML_NOCDATA);
print (string) $person->name; # Someone&#039;s Name

php文档说,NOCDATA将CDATA合并为文本节点".对我来说,这意味着CDATA将与文本节点一样对待-或第3个示例的行为现在与第2个示例相同.

The php docs say that NOCDATA "Merge[s] CDATA as text nodes". To me this means that CDATA will then be treated the same as text nodes - or that the behavior of the 3rd example will now be the same as the 2nd example.

我无法控制XML(这是来自外部源的提要),否则我将删除CDATA标记,因为它什么都不做,并且破坏了我想要的行为.

I don't have control over the XML (it's a feed from an external source), otherwise I'd just remove the CDATA tag as it does nothing and ruins the behavior I want.

为什么上述示例的行为方式如此?有没有什么方法可以使SimpleXML处理CDATA节点的方式与处理文本节点的方式相同?因为我似乎不了解该选项,所以将CDATA合并为文本节点"实际上是做什么的?

Why does the above example behave the way that it does? Is there any way to make SimpleXML handle the CDATA nodes in the same way that it handles text nodes? What does "Merge CDATA as text nodes" actually do, since I don't seem to be understanding that option?

提取数据后,我目前正在解码,但是上面的示例对我来说仍然没有意义.

I'm currently decoding after I pull out the data, but the above example still doesn't make sense to me.

推荐答案

XML中CDATA节的目的是原样"封装文本块,否则将需要特殊字符(特别是><&)进行转义.包含字符&的CDATA节与包含&amp;的普通文本节点相同.

The purpose of CDATA sections in XML is to encapsulate a block of text "as is" which would otherwise require special characters (in particular, >, < and &) to be escaped. A CDATA section containing the character & is the same as a normal text node containing &amp;.

如果解析器提供忽略它的功能,并且假装所有CDATA节点实际上只是文本节点,那么只要有人提到"P& O Cruises",它就会立即中断-&根本就不会出现

If a parser were to offer to ignore this, and pretend all CDATA nodes were really just text nodes, it would instantly break as soon as someone mentioned "P&O Cruises" - that & simply can't be there on its own (rather than as &amp;, or &somethingElse;).

LIBXML_NOCDATA实际上对于SimpleXML几乎没有用,因为(string)$foo巧妙地将文本和CDATA节点的任何序列组合到一个普通的PHP字符串中. (有些人经常注意到的东西,因为print_r并不是.)对于诸如DOM之类的更系统的访问方法,这不一定是正确的,在DOM中,您可以独立地将文本节点和CDATA节点作为对象进行操作.

The LIBXML_NOCDATA is actually pretty useless with SimpleXML, because (string)$foo neatly combines any sequence of text and CDATA nodes into an ordinary PHP string. (Something which people frequently fail to notice, because print_r doesn't.) This isn't necessarily true of more systematic access methods, such as DOM, where you can manipulate text nodes and CDATA nodes as objects in their own right.

它有效执行的工作是遍历文档,无论遇到CDATA节,它都会获取内容,对其进行转义,然后将其放回普通文本节点,或者将其与任何文本节点合并"到任一边.所表示的文本是相同的,只是以不同的方式存储在文档中.如本例所示,如果您导出回XML,则可以看到不同之处:

What it effectively does is go through the document, and wherever it encounters a CDATA section, it takes the content, escapes it, and puts it back as an ordinary text node, or "merges" it with any text nodes to either side. The text represented is identical, just stored in the document in a different way; you can see the difference if you export back to XML, as in this example:

$xml_string = "<person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>";

$person = new SimpleXMLElement($xml_string);
echo 'CDATA retained: ', $person->asXML();
// CDATA retained: <?xml version="1.0"?>
// <person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>

$person = new SimpleXMLElement($xml_string, LIBXML_NOCDATA);
echo 'CDATA merged: ', $person->asXML();
// CDATA merged: <?xml version="1.0"?>
// <person><name>Welcome aboard this P&amp;O Cruises voyage!</name></person>

如果您正在解析的XML文档包含一个实际上包含实体的CDATA部分,则需要采用该字符串并将其完全与XML分离开.这样做的一个常见原因(除了对图书馆了解得不那么懒惰以外)是将HTML中标记的内容视为XML文档中的任何旧字符串,例如:

If the XML document you're parsing contains a CDATA section which actually contains entities, you need to take that string and unescape it completely independent of the XML. One common reason to do this (other than laziness with poorly understood libraries) is to treat something marked up in HTML as just any old string inside an XML document, like this:

<Comment>
<SubmittedBy>IMSoP</SubmittedBy>
<Text><![CDATA[I'm <em>really</em> bad at keeping my answers brief <tt>;)</tt>]]></Text>
</Comment>

这篇关于PHP,SimpleXML,CDATA中的解码实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆