找出CDATA在lxml元素中的位置? [英] Figuring out where CDATA is in lxml element?

查看:320
本文介绍了找出CDATA在lxml元素中的位置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析并重建一种解析器使用的文件格式,解析器使用的语言只能慈善地描述为XML.我意识到符合标准的XML并不关心CDATA或空白,但是不幸的是,此应用程序要求我同时关心这两个...

I need to parse and rebuild a file format used by a parser which speaks a language that can only charitably be described as XML. I realize that standards-compliant XML doesn't care about either the CDATA or the whitespace, but unfortunately this application demands that I care about both...

我正在使用lxml.etree,因为它非常擅长保存CDATA.

I'm using lxml.etree because it's pretty good at preserving CDATA.

例如:

s = '''
<root>
  <item>
     <![CDATA[whatever]]>
  </item>
</root>'''

import lxml.etree as et
et.fromstring(s, et.XMLParser(strip_cdata=False))
item = root.find('item')
print et.tostring(item)

此打印:

<item>
    <![CDATA[whatever]]>
  </item>

lxml完全保留了<item>标签的格式...太棒了!

lxml has exactly preserved the formatting of the <item> tag... great!

问题是我没有任何方法可以准确地指出CDATA在标记文本中的开始和结束位置.属性item.text不能准确指示文本的哪一部分包装在CDATA中:

The problem is that I don't have any way to tell exactly where the CDATA begins and ends within the text of the tag. The property item.text gives no indication of exactly which part of the text is wrapped in CDATA:

item.text
 ==> '\n     whatever\n  '

因此,如果我对其进行修改,然后尝试将其作为CDATA吐回去,那么我将失去空白的位置:

So if I modify it, and try to spit it back out as CDATA, then I lose the locations of the whitespace:

item.text = CDATA('foobar')
et.tostring(item)
 ==> '<item><![CDATA[foobar]]></item>\n'

很显然,lxml知道" CDATA在节点文本内的位置,因为它使用node.tostring()保留了它.但是,我无法找到一种方法来反省文本的哪些部分是CDATA,哪些不是. 有什么建议吗?

Clearly, lxml "knows" where the CDATA is located within the text of a node, because it preserves it with node.tostring(). However, I can't figure out a way to introspect which parts of the text are CDATA and which aren't. Any advice?

推荐答案

我不确定lxml,但是使用minidom可以更改CDATA节并保留周围的空白,因为CDATASection是单独的节点类型.

I'm not sure about lxml, but with minidom you can change the CDATA section and preserve the surrounding whitespace, as CDATASections are a separate node type.

>>> from xml.dom import minidom
>>> data = minidom.parseString(s)
>>> parts = data.getElementsByTagName('item')
>>> item = parts[0]
>>> item.childNodes
[<DOM Text node "u'\n     '">, <DOM CDATASection node "u'whatever'">, <DOM Text node "u'\n  '">]
>>> item.childNodes[1].nodeValue = 'changed'
>>> print item.toxml()
<item>
     <![CDATA[changed]]>
  </item>

有关更多详细信息,请参见 xml.dom.minidom:获取CDATA值.

See xml.dom.minidom: Getting CDATA values for more details.

这篇关于找出CDATA在lxml元素中的位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆