SimpleXML,请不要展开实体 [英] SimpleXML, please do not expand entities

查看:27
本文介绍了SimpleXML,请不要展开实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 SimpleXML 尝试解析 大型 XML 文件 带有 声明.不幸的是,SimpleXML 似乎太急于继续扩展这些实体,而我宁愿它没有,因为实体符号很短,易于解析,并且理论上不会在较新版本的文件中改变,而扩展的实体是可能会改变的英语句子.有没有办法告诉 SimpleXML 取消它?

I'm using SimpleXML to try to parse a large XML file with <!ENTITY declarations. Unfortunately, SimpleXML seems too eager to go ahead and expand those entities, and I'd rather it didn't, since the entity symbols are short, easily parseable, and theoretically won't change in newer versions of the file, while the expanded entities are English sentences that may change. Is there any way to tell SimpleXML to knock it off?

我想过在将文件内容传递给 XML 解析器之前预解析"XML 文件以去除 <!ENTITY 位,但这感觉很糟糕,因为它是一个巨大的文件,我宁愿尽可能少地摆弄它.

I've thought of "pre-parsing" the XML file to strip out the <!ENTITY bits before passing the file contents to the XML parser, but that feels hacky, and since it's a huge file, I'd rather do as little fiddling with it as possible.

(请原谅上述任何错误的术语;我已经有一段时间没有完成这种级别的 XML 工作了.)

(Pardon any mistaken terminology in the above; I haven't done this level of XML work in quite a while.)

推荐答案

可能看起来是这样,但事实并非如此(除非您指定标志,我猜您不会尽管您没有在代码中显示您做).只是 SimpleXML 只能在您使用 ->asXML() 方法而不是通过 to-string-implementation 返回给您.

It might seem so, but it's not the case (unless you specify the flag which I guess you don't albeit you don't show in code what you do). It's just that SimpleXML can only return it to you if you're using the ->asXML() method not via the to-string-implementation.

让我们做一些例子来演示它是如何工作的.我从 DTD 中选择了这个简单的实体:

Let's do some example to demonstrate how it works. I've picked this simple entity from the DTD:

<!ENTITY n "noun (common) (futsuumeishi)">

所以让我们选择第一个 元素,因为它包含一个 &n; 实体:

So let's select the first <pos> element as it contains an &n; entity:

$xml = simplexml_load_file($file);
$pos = $xml->entry->sense->pos;

变量 $pos 现在是 元素节点的 SimpleXMLElement.让我们输出它看看解析器对 &n; 实体做了什么:

The variable $pos is now the SimpleXMLElement of the <pos> element node. Let's output it to see what the parser does with the &n; entity:

echo  "SimpleXML value (string): ", $pos         , "\n"
    , "SimpleXML value (XML)   : ", $pos->asXML(), "\n";

输出为:

SimpleXML value (string): noun (common) (futsuumeishi)
SimpleXML value (XML)   : <pos>&n;</pos>

如本例所示,&n; 仍然存在(&n;</pos>),只是它将在您访问它的那一刻扩展为字符串值 (noun (common) (futsuumeishi)).

As this example shows, the &n; is still there (<pos>&n;</pos>), it's just that it will be expanded the moment you access it as the string value (noun (common) (futsuumeishi)).

顺便说一句,这完全没问题,XML 规范在这里说,是否扩展这些实体取决于解析器.对于 SimpleXML 的设计目的,这完全可以在读取字符串值时扩展.

This by the way is totally OK, the XML specs say here that it's up to the parser whether to expand those entities or not. For what SimpleXML has been designed for, this is totally expected to expand when reading the string value.

您甚至可以通过指定 LIBXML_NOENT 选项来控制此行为:

You can even control this behavior by specifying the LIBXML_NOENT option:

$xml = simplexml_load_file($file, NULL, LIBXML_NOENT);

这实际上会执行您当时的假设,实体现在已展开,XML 输出不再包含实体:

This will actually do what you assume then, the entities are expanded now, the XML output does not contain the entity any longer:

SimpleXML value (string): noun (common) (futsuumeishi)
SimpleXML value (XML)   : <pos>noun (common) (futsuumeishi)</pos>

那么现在双问号如何做你正在寻找的东西?好吧,实际上具有实体模型的 PHP 中的 XML 解析器是 DOMDocument.它是 SimpleXML 的姊妹库,内部共享相同的内存对象.这是没有和有 LIBXML_NOENT 的这两种模式的相同对象(更准确地说:它唯一的子节点)的输出:

So now double question mark how to do what you're looking for? Well, an XML parser in PHP which actually has a model for entities is DOMDocument. It is a sister library of SimpleXML, internally both share the same memory objects. Here is the output of that same object (more precise: its only child node) for those two modes without and with LIBXML_NOENT:

Mode 1:
DOMDocument Class       : DOMEntityReference
DOMDocument value(XML)  : &n;
DOMDocument ->nodeName  : n

Mode 2 (LIBXML_NOENT):
DOMDocument Class       : DOMText
DOMDocument value(XML)  : noun (common) (futsuumeishi)
DOMDocument ->nodeName  : #text

这是由以下代码创建的,它应该使给定输出后面的内容更加可见:

This is created by the following code which should make more visible what is behind the given output:

$node   = dom_import_simplexml($pos);
$doc    = $node->ownerDocument;
$entity = $node->firstChild;

echo  "DOMDocument Class       : ", get_class($entity)    , "\n"
    , "DOMDocument value(XML)  : ", $doc->saveXML($entity), "\n"
    , "DOMDocument ->nodeName  : ", $entity->nodeName     , "\n";

正如所写的那样,它是一个姐妹库,dom_import_simplexml$pos 变成了一个 DOMElement,我们需要遍历它的孩子我们知道是有问题的实体引用.

As written it is a sister library and dom_import_simplexml turns $pos into a DOMElement of which we need to traverse the children of it which we know is the entity reference in question.

所以现在这开始变得很有意义:由于 SimpleXML 不能表示实体引用,它只能提供扩展的字符串值包含实体的 XML.

So now this starts to make perfect sense: As SimpleXML can not represent an Entity Reference, it can only provide the expanded string value or the XML containing the entity.

否则有什么方法可以区分

Otherwise what would be the way to differ the string value of

<pos>&n;</pos>
<pos><![CDATA[&n;]]></pos>

?所以你所要求的只是有限的意义.然而,这并不意味着我们无法处理它,因此可以通过扩展 SimpleXML 来欺骗它来做到这一点.假设每个只包含单个实体的子元素都应该返回 so.否则应使用标准 SimpleXML 字符串化:

? So what you ask for makes only limited sense. However that doesn't mean we could not deal with that and so therefore can trick SimpleXML to do that by extending from it. Let's say each child element that only contains a single entity should return so. Otherwise standard SimpleXML stringyfication should be used:

/**
 * Class EntityPreserveXML
 */
class EntityPreserveXML extends SimpleXMLElement
{
    /**
     * @return string
     */
    public function __toString()
    {
        $dom = dom_import_simplexml($this);
        if (
            !$dom instanceof DOMElement
            || $dom->childNodes->length !== 1
            || ! $dom->firstChild instanceof DOMEntityReference
        ) {
            return parent::__toString();
        }

        return $dom->ownerDocument->saveXML($dom->firstChild);
    }
}

让我们在上面的例子中运行它:

Let's just let that run on our example from above:

require('EntityPreserveXML.php');
$xml = simplexml_load_file($file, 'EntityPreserveXML');
$pos = $xml->entry->sense->pos;

echo  "SimpleXML value (string): ", $pos         , "\n"
    , "SimpleXML value (XML)   : ", $pos->asXML(), "\n";

SimpleXML 现在使用扩展类,然后按预期给出:

SimpleXML is now using the extended class, which then gives as expected:

SimpleXML value (string): &n;
SimpleXML value (XML)   : <pos>&n;</pos>

&n; 因为它是唯一的子元素,所以现在保留在 SimpleXMLElement 的字符串转换中.但仅仅因为这有效并不意味着你应该使用它,它打破了文本形式的解析 XML 和文档模型意义上的 XML 之间的编码边界.

The &n; as it is the only child is now preserved in the to-string conversion of the SimpleXMLElement. But only because this works must not mean you should use this, it breaks an encoding boundary between parsed XML in the form of text and just XML in the meaning of the Document Model.

可能您只是在寻找 DOMDocument?这是一个包含更多细节的模型,您可以从中使用 DOMEntityReferences(如果有的话).

Probably you're just looking for DOMDocument? It's a model with much more details from which you can just use DOMEntityReferences if there are any.

这篇关于SimpleXML,请不要展开实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆