lxml-是否有任何骇人听闻的方式来保留& quot? [英] lxml - Is there any hacky way to keep "?

查看:34
本文介绍了lxml-是否有任何骇人听闻的方式来保留& quot?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到xml实体& quot 将自动强制转换为其真实的原始字符:

I noticed the xml entities &quot will automatically force to convert to their real original characters:

>>> from lxml import etree as et
>>> parser = et.XMLParser()
>>> xml = et.fromstring("<root><elem>&quot;hello world&quot;</elem></root>", parser)
>>> print et.tostring(xml, pretty_print=1)
<root>
  <elem>"hello world"</elem>
</root>

>>> 

我发现一个相关的旧版本(2009-02-07)线程:

I found one related old(2009-02-07) thread:

s = cStringIO.StringIO(""她是MAN!"")e = etree.parse(s,etree.XMLParser(resolve_entities = False))

s = cStringIO.StringIO(""""She's the MAN!"""") e = etree.parse(s,etree.XMLParser(resolve_entities=False))

请注意,还有etree.fromstring().

Note that there's also etree.fromstring().

etree.tostring(e)她是男人!"

etree.tostring(e) '"She\'s the MAN!"'

我会期望resolve_entities = False可以防止例如,"的翻译.到.

I would have expected resolve_entities=False to have prevented the translation of, eg, " to ".

"resolve_entities"选项适用于DTD中定义的实体您要保留引用而不是解析值.您提到的实体是XML规范的一部分,而不是DTD的一部分.

The "resolve_entities" option is meant for entities defined in a DTD of which you want to keep the reference instead of the resolved value. The entities you mention are part of the XML spec, not of a DTD.

还有另一种方法可以防止此行为(或者,如果没有其他方法,在事实发生后将其反转)?

is there another way to prevent this behavior (or, if nothing else, reverse it after the fact)?

好吧,您得到的是格式正确的XML.请问您为什么需要输出中有实体引用?

Well, what you get is well-formed XML. May I ask why you need the entity references in the output?

仍然,您的答复就是您要这样做的原因,但没有直接解决此问题的方法.我很惊讶,因为etree解析器强制转换而没有提供禁用它的选项.

Still, the response is why you want to do that, there's no direct answer to this problem. I'm quite surprised because the etree parser force the conversion without giving an option to disable it.

以下示例显示了为什么我需要此解决方案,该xml用于 xbmc外观解析器:

The following example shown why i need this solution, this xml is for xbmc skinning parser:

>>> print open("/tmp/so.xml").read() #the original file
<window id="1234">
        <defaultcontrol>101</defaultcontrol>
        <controls>
                <control type="button" id="101">
                        <onfocus>Dialog.Close(212)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="102">
                        <visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
                        <onfocus>RunScript(script.test)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="103">
                        <visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
                        <onfocus>Close</onfocus>
                        <onfocus>RunScript(&quot;/.xbmc/addons/script.hello.world/default.py&quot;,&quot;$INFO[VideoPlayer.Album]&quot;,&quot;$INFO[VideoPlayer.Genre]&quot;)</onfocus>
                </control>
        </controls>
</window>

>>> root = et.parse("/tmp/so.xml", parser)
>>> r = root.getroot()
>>> for c in r:
...     for cc in c:
...         if cc.attrib.get('id') == "103":
...             cc.remove(cc[1]) #remove 1 element, it's just a demonstrate
... 
>>> o = open("/tmp/so.xml", "w")
>>> o.write(et.tostring(r, pretty_print=1)) #save it back
>>> o.close()
>>> print open("/tmp/so.xml").read() #the file after implemented 
<window id="1234">
        <defaultcontrol>101</defaultcontrol>
        <controls>
                <control type="button" id="101">
                        <onfocus>Dialog.Close(212)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="102">
                        <visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
                        <onfocus>RunScript(script.test)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="103">
                        <visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
                        <onfocus>RunScript("/.xbmc/addons/script.hello.world/default.py","$INFO[VideoPlayer.Album]","$INFO[VideoPlayer.Genre]")</onfocus>
                </control>
        </controls>
</window>

>>> 

正如您所看到的,id后面的 onfocus 元素位于ID"103"的末尾,& quot 不再是原始形式,它导致如果"$ INFO [VideoPlayer.Album]"变量包含嵌套的引号并变为" test",则该错误无效且错误.

As you can see of the onfocus element under id "103" at the end, the &quot are no longer in their original form, and it lead to bug if the "$INFO[VideoPlayer.Album]" variable contains nested quotes and become ""test"" which was invalid and error.

那么我可以将& quot 保留为原始格式吗?

So is it any hacky way i can keep &quot in their original form ?

[更新]:对于感兴趣的人,其他三个预定义的xml实体(即 gt lt amp )只能通过使用 method ="html" 脚本标签.lxml VS xml.etree.ElementTree或python2 VS python3具有相同的机制并使人们感到困惑:

[UPDATE]: For someone who interest, the other 3 predefined xml entities, i.e. gt, lt and amp will only get converted by using method="html" and script tag. Either lxml VS xml.etree.ElementTree or python2 VS python3 have the same mechanism and make people confuse:

>>> from lxml import etree as et
>>> r = et.fromstring("<root><script>&quot;&apos;&amp;&gt;&lt;</script><p>&quot;&apos;&amp;&gt;&lt;</p></root>")
>>> print et.tostring(r, pretty_print=1, method="xml")
<root>
  <script>"'&amp;&gt;&lt;</script>
  <p>"'&amp;&gt;&lt;</p>
</root>

>>> print et.tostring(r, pretty_print=1, method="html")
<root><script>"'&><</script><p>"'&amp;&gt;&lt;</p></root>

>>> 

[UPDATE2]:以下是所有可能的html标签的列表:

[UPDATE2]: The following is the list of all possible html tags:

#https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py
acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area',
'article', 'aside', 'audio', 'b', 'big', 'blockquote', 'br', 'button',
'canvas', 'caption', 'center', 'cite', 'code', 'col', 'colgroup',
'command', 'datagrid', 'datalist', 'dd', 'del', 'details', 'dfn',
'dialog', 'dir', 'div', 'dl', 'dt', 'em', 'event-source', 'fieldset',
'figcaption', 'figure', 'footer', 'font', 'form', 'header', 'h1',
'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'input', 'ins',
'keygen', 'kbd', 'label', 'legend', 'li', 'm', 'map', 'menu', 'meter',
'multicol', 'nav', 'nextid', 'ol', 'output', 'optgroup', 'option',
'p', 'pre', 'progress', 'q', 's', 'samp', 'section', 'select',
'small', 'sound', 'source', 'spacer', 'span', 'strike', 'strong',
'sub', 'sup', 'table', 'tbody', 'td', 'textarea', 'time', 'tfoot',
'th', 'thead', 'tr', 'tt', 'u', 'ul', 'var', 'video']
from lxml import etree as et
for e in acceptable_elements:
    r = et.fromstring(e.join(["<", ">hello&amp;world</", ">"]))
    s = et.tostring(r, pretty_print=1, method="html")
    closed_tag = "</" + e + ">"
    if closed_tag not in s:
        print s

运行此代码,您将看到如下输出:

Run this code and you will see output as following:

<area>

<br>

<col>

<hr>

<img>

<input>

如您所见,只打印了开始的标签,其余的只是进入了黑洞.我测试了所有5个xml实体,并且所有实体都具有相同的行为.真是令人困惑.使用HTMLParser时不会发生这种情况,因此我认为fromstring(方法应默认为xml)和tostring(method ="html")步骤之间存在错误.而且我发现它与实体无关,因为< img>你好</img>"(没有实体)被截断为<img>也是(而您好,它什么都没有,如果使用method ="xml"打印出来,它可以随时出现).

As you can see, only opening tag printed and the rest was just go into black hole. I tested all 5 xml entities and all have the same behavior. It's so confusing. This did not happen when using HTMLParser, so i guess there's buggy between fromstring(method should be default to xml) and tostring(method="html") steps. And i found it has nothing to do with entities because "< img >hello< /img >"(without entities) is truncate into < img > too(and hello just gone to nowhere, it can appear at anytime if use method="xml" to print out).

推荐答案

from xml.sax.saxutils import escape
from lxml import etree

def to_string(xdoc):
    r = ""
    for action, elem in etree.iterwalk(xdoc, events=("start", "end")):
        if action == 'start':
            text = escape(elem.text, {"'": "&apos;", "\"": "&quot;"}) if elem.text is not None else ""
            attrs = "".join([' %s="%s"' % (k, v) for k, v in elem.attrib.items()])
            r += "<%s%s>%s" % (elem.tag, attrs, text)
        elif action == 'end':
            r += "</%s>%s" % (elem.tag, elem.tail if elem.tail else "\n")
    return r
xdoc = etree.fromstring(xml_text)
s = to_string(xdoc)

这篇关于lxml-是否有任何骇人听闻的方式来保留&amp; quot?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆