如何遍历xml数据以使用lxml删除下一个重复元素 [英] how to iterate through xml data to remove next duplicate element using lxml

查看:62
本文介绍了如何遍历xml数据以使用lxml删除下一个重复元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力想出一个简单的解决方案,该解决方案对xml数据进行迭代以删除下一个元素(如果它是实际元素的重复元素).

I am struggling to come up with a simple solution which iterates over xml data to remove the next element if it is a dplicate of the actual one.

示例:

来自此输入":

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data2</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>

我想得到这个输出":

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
</root>'''

为此,我想到了以下代码:

for doing this I came up with the following code:

from lxml import etree
from io import StringIO


xml = '''
<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data2</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>'''

# this is to simulate that above xml was read from a file
file = StringIO(unicode(xml))

# reading the xml from a file
tree = etree.parse(file)
root = tree.getroot()

# iterate over all "b" elements
for element in root.iter('b'):
    # checks if the last "b" element has been reached.
    # on last element it raises "AttributeError" eception and terminates the for loop
    try:
        # attributes of actual element
        elem_attrib_ACT = element.attrib
        # attributes of next element
        elem_attrib_NEXT = element.getnext().attrib
    except AttributeError:
        # if no other element, break
        break
    print('attributes of ACTUAL elem:', elem_attrib_ACT, 'attributes of NEXT elem:', elem_attrib_NEXT)
    if elem_attrib_ACT == elem_attrib_NEXT:
        print('next elem is duplicate of actual one -> remove it')
        # I would like to remove next element but this approach is not working
        # if you uncomment, it removes the elements of "data2" but stops
        # how to remove the next duplicate element?
        #element.getparent().remove(element.getnext())
    else:
        print('next elem is not a duplicate of actual')

print('result:')
print(etree.tostring(root))

注释行

#element.getparent().remove(element.getnext())

删除"data2"周围的元素,但停止执行.生成的xml就是这样的:

removes the elements around "data2" but stops execution. the resulting xml is this one:

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>

我的印象是我剪掉了我坐在的树枝" ...

my impression is that i "cut the branch on which I am sitting"...

关于如何解决这个问题的任何建议?

any suggestions how to solve this one?

推荐答案

我认为您的猜想是正确的,如果您在except块中插入打印语句之前就可以看到它正在中断,因为该元素已经删除(我认为)

I think your suspicion is correct, if you put a print statement before you break in the except block you can see it's breaking early because this element has been removed (I think)

<b attrib1="abc" attrib2="def">
    <c>data2</c>
</b>

尝试使用getprevious()而不是getnext().我还更新为使用列表推导来避免第一个元素上的错误(当然,这会在.getprevious()处引发异常):

Try using getprevious() instead of getnext(). I also updated to use list comprehension to avoid the error on the first element (which of course will raise an exception at the .getprevious()):

for element in [e for e in root.iter('b')][1:]:
    try:
        if element.getprevious().attrib == element.attrib:
            element.getparent().remove(element)
    except:
        print 'except  '
print etree.tostring(root)

结果:

<root>
<b attrib1="abc" attrib2="def">
    <c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
    <c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
    <c>data4</c>
</b>
</root>

这篇关于如何遍历xml数据以使用lxml删除下一个重复元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆