如何防止lxml remove方法删除两个元素之间的文本 [英] How to prevent lxml remove method from removing text between two elements
问题描述
我正在使用lxml和python 2.7来解析xml文件.我需要在某个时候使用remove方法删除一个元素,但是很奇怪的是它也删除了它后面的一些文本.
I'm using lxml and python 2.7 to parse xml files. I need to use remove method to remove an element at some point, but very strangely it removes some text after it as well.
输入xml是:
<ce:para view="all">Web and grid services <ce:cross-refs refid="BIB10 BIB11">[10,11]</ce:cross-refs>, where they can provide rich service descriptions that can help in locating suitable services.</ce:para>
然后我需要扩大
then I need to expand the cross-refs
element to multiple cross-ref
with separated refid
. So the output should be something like this:
<ce:para view="all">Web and grid services <ce:cross-ref refid="BIB10">[10]</ce:cross-ref><ce:cross-ref refid="BIB11">[11]</ce:cross-ref>, where they can provide rich service descriptions that can help in locating suitable services.</ce:para>
这是python的缩写代码:
and here's the python the code with some abbreviation:
xpath = "//ce:cross-refs"
cross_refs = tree.xpath(xpath, namespaces={'ce': 'http://www.elsevier.com/xml/common/dtd'})
for c in cross_refs:
c_parent = c.getparent()
c_values = c.text.strip("[]")
...
ref_ids = c.attrib['refid'].strip().split()
i = 0
for r in ref_ids:
...
tag = et.QName(CE, 'cross-ref')
exploded_cross_refs = et.Element(tag, refid=r, nsmap=NS_MAP)
exploded_cross_refs.text = "[" + c_values[i] + "]"
c.addprevious(exploded_cross_refs)
i += 1
c_parent.remove(c)
会得到cross-refs
元素,展开refid
值和元素文本值,然后创建新的cross-ref
元素并将其添加到原始cross-refs
之前,最后我要删除旧的cross-refs
元素和我的问题正好在这里:当我删除此元素时,结束标记和下一个元素之间的文本也将被删除,因此最终结果如下:
which gets cross-refs
element, expand refid
values and element text values, and then creates new cross-ref
elements and add them before the original cross-refs
and finally I want to remove old cross-refs
element and my problem is exactly here: When I remove this element, the text between the closing tag and next element gets removed as well, so the final result is like this:
<ce:para view="all">Web and grid services <ce:cross-ref refid="BIB10">[10]</ce:cross-ref><ce:cross-ref refid="BIB11">[11]</ce:cross-ref></ce:para>
请注意,最后一个cross-ref
和para
元素之间的文本已被删除!我该如何解决这个问题?
Notice that the text between last cross-ref
and para
element has been removed! How can I fix this issue?
推荐答案
或者,尤其是在某些父对象中某些名称的并非所有元素需要删除的情况下,我们可以创建一个简单的方法来会在实际删除元素之前将尾部附加到上一个元素(如果有的话),否则将其附加到父元素的文本:
Alternatively, especially in case not all elements of certain name within a certain parent need to be removed, we can create simple method that will append the tail to previous element, if any, or append it to the parent's text otherwise, before the element actually get removed :
def remove_preserve_tail(element):
if element.tail:
prev = element.getprevious()
parent = element.getparent()
if prev is not None:
prev.tail = (prev.tail or '') + element.tail
else:
parent.text = (parent.text or '') + element.tail
parent.remove(element)
演示:
>>> from lxml import etree
>>> raw = '''<root>
... foo
... <div></div>has tail and no prev
... <br/><div></div>has tail and prev
... <br/>
... <div>no tail, whitespaces only</div>
... </root>'''
...
>>> root = etree.fromstring(raw)
>>> divs = root.xpath("//div")
>>> for div in divs:
... remove_preserve_tail(div)
...
>>> print etree.tostring(root)
<root>
foo
has tail and no prev
<br/>has tail and prev
<br/>
</root>
这篇关于如何防止lxml remove方法删除两个元素之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!