如何防止lxml remove方法删除两个元素之间的文本 [英] How to prevent lxml remove method from removing text between two elements

查看:143
本文介绍了如何防止lxml remove方法删除两个元素之间的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用lxml和python 2.7来解析xml文件.我需要在某个时候使用remove方法删除一个元素,但是很奇怪的是它也删除了它后面的一些文本.

I'm using lxml and python 2.7 to parse xml files. I need to use remove method to remove an element at some point, but very strangely it removes some text after it as well.

输入xml是:

<ce:para view="all">Web and grid services <ce:cross-refs refid="BIB10 BIB11">[10,11]</ce:cross-refs>, where they can provide rich service descriptions that can help in locating suitable services.</ce:para>

然后我需要扩大到多个与分离元素.所以输出应该是这样的:

then I need to expand the cross-refs element to multiple cross-ref with separated refid. So the output should be something like this:

<ce:para view="all">Web and grid services <ce:cross-ref refid="BIB10">[10]</ce:cross-ref><ce:cross-ref refid="BIB11">[11]</ce:cross-ref>, where they can provide rich service descriptions that can help in locating suitable services.</ce:para>

这是python的缩写代码:

and here's the python the code with some abbreviation:

xpath = "//ce:cross-refs"
cross_refs = tree.xpath(xpath, namespaces={'ce': 'http://www.elsevier.com/xml/common/dtd'})
for c in cross_refs:
    c_parent = c.getparent()
    c_values = c.text.strip("[]")
    ...
    ref_ids = c.attrib['refid'].strip().split()
    i = 0
    for r in ref_ids:
        ...
        tag = et.QName(CE, 'cross-ref')
        exploded_cross_refs = et.Element(tag, refid=r, nsmap=NS_MAP)
        exploded_cross_refs.text = "[" + c_values[i] + "]"
        c.addprevious(exploded_cross_refs)
        i += 1
    c_parent.remove(c)

会得到cross-refs元素,展开refid值和元素文本值,然后创建新的cross-ref元素并将其添加到原始cross-refs之前,最后我要删除旧的cross-refs元素和我的问题正好在这里:当我删除此元素时,结束标记和下一个元素之间的文本也将被删除,因此最终结果如下:

which gets cross-refs element, expand refid values and element text values, and then creates new cross-ref elements and add them before the original cross-refs and finally I want to remove old cross-refs element and my problem is exactly here: When I remove this element, the text between the closing tag and next element gets removed as well, so the final result is like this:

<ce:para view="all">Web and grid services <ce:cross-ref refid="BIB10">[10]</ce:cross-ref><ce:cross-ref refid="BIB11">[11]</ce:cross-ref></ce:para>

请注意,最后一个cross-refpara元素之间的文本已被删除!我该如何解决这个问题?

Notice that the text between last cross-ref and para element has been removed! How can I fix this issue?

推荐答案

或者,尤其是在某些父对象中某些名称的并非所有元素需要删除的情况下,我们可以创建一个简单的方法来会在实际删除元素之前将尾部附加到上一个元素(如果有的话),否则将其附加到父元素的文本:

Alternatively, especially in case not all elements of certain name within a certain parent need to be removed, we can create simple method that will append the tail to previous element, if any, or append it to the parent's text otherwise, before the element actually get removed :

def remove_preserve_tail(element):
    if element.tail:
        prev = element.getprevious()
        parent = element.getparent()
        if prev is not None:
            prev.tail = (prev.tail or '') + element.tail
        else:
            parent.text = (parent.text or '') + element.tail
    parent.remove(element)

演示:

>>> from lxml import etree
>>> raw = '''<root>
... foo
... <div></div>has tail and no prev
... <br/><div></div>has tail and prev
... <br/>
... <div>no tail, whitespaces only</div>
... </root>'''
... 
>>> root = etree.fromstring(raw)
>>> divs = root.xpath("//div")
>>> for div in divs:
...     remove_preserve_tail(div)
... 
>>> print etree.tostring(root)
<root>
foo
has tail and no prev
<br/>has tail and prev
<br/>

</root>

这篇关于如何防止lxml remove方法删除两个元素之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆