Python从xml树中删除重复元素 [英] Python remove duplicate elements from xml tree

查看:48
本文介绍了Python从xml树中删除重复元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 xml 结构,其中包含一些不唯一的元素.所以我设法对子树进行排序,并且可以过滤我不止一次的元素.但是remove功能好像不适用.

I have a xml structure with some elements which are not unique. So I managed to sort the subtrees and I can filter propper the elements which I have more than one time. But the remove function seems not to apply.

我的 XML 结构看起来像这样简化:

My XML Structure looks simplified like this:

<root>
  <page>
    <text>blabla blub unique</text>
    <text>blabla blub not unique</text>
    <text>blabla blub not unique</text><!-- line should be removed -->
    <text>blabla blub not unique</text><!-- line should be removed -->
    <text>blabla blub not unique</text><!-- line should be removed -->
    <text>blabla blub again unique</text>
  </page>
  <page>
    <text>2nd blabla blub unique</text>
    <text>2nd blabla blub not unique</text>
    <text>2nd blabla blub not unique</text><!-- line should be removed -->
    <text>2nd blabla blub again unique</text>
  </page>
</root>

我想删除每个页面上的双字符串,所以我在两个 for 循环中遍历页面和页面中的元素:(重要行的摘录,我希望没有忘记任何东西)

I want to remove double strings on each page, so I'm iterating over pages and over elements in page in two for loops: (extract of important lines, I hope didn't forget anything)

import xml.etree.ElementTree as ET
self.tree = ET.parse(path)
self.root = self.tree.getroot()
self.prev = None
# [...]
for page in self.root:                     # iterate over pages
    for elem in page:
        if elements_equal(elem, self.prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            page.remove(elem) # <---- removes just one line
            continue
        self.prev = elem
# [...]
self.tree.write("out.xml") # 2 duplicate lines still there....

更新:代码似乎有效,但它只删除了一个重复项,而不是全部

update: The code seems to work, but it removes just one duplicate, not all

推荐答案

我不知道你是如何定义 elements_equal 的,但是(无耻地改编自 测试 xml.etree.ElementTree 的等效性) 这对我有用:

I don't know how you've defined elements_equal, but (shamelessly adapted from Testing Equivalence of xml.etree.ElementTree) this works for me:

在迭代 page 时存储要删除的每个元素的列表,然后删除它们而不是在一个循环中进行删除.

store a list of each element to be removed whilst iterating over page and then remove them rather than doing the removal within one loop.

在元素标签的比较中注意到代码中的一个小错字并更正.

Noticed a small typo in the code in the comparison of the element tags and correct it.

import xml.etree.ElementTree as ET

path = 'in.xml'

tree = ET.parse(path)
root = tree.getroot()
prev = None

def elements_equal(e1, e2):
    if type(e1) != type(e2):
        return False
    if e1.tag != e2.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])

for page in root:                     # iterate over pages
    elems_to_remove = []
    for elem in page:
        if elements_equal(elem, prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            elems_to_remove.append(elem)
            continue
        prev = elem
    for elem_to_remove in elems_to_remove:
        page.remove(elem_to_remove)
# [...]
tree.write("out.xml")

给出:

$ python undupe.py
found duplicate: blabla blub not unique
found duplicate: 2nd blabla blub not unique
$ cat out.xml
<root>
  <page>
    <text>blabla blub unique</text>
    <text>blabla blub not unique</text>
    <text>blabla blub again unique</text>
  </page>
  <page>
    <text>2nd blabla blub unique</text>
    <text>2nd blabla blub not unique</text>
    <text>2nd blabla blub again unique</text>
  </page>

这篇关于Python从xml树中删除重复元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆