如何删除重复的节点xml Python [英] How to remove duplicate nodes xml Python

查看:39
本文介绍了如何删除重复的节点xml Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个特殊的 xml 文件结构类似于:

I have a special case xml file structure is something like :

<Root>
    <parent1>
         <parent2>
             <element id="Something" >
         </parent2>
     </parent1>
     <parent1>
         <element id="Something">
     </parent1>
</Root>

我的用例是删除重复的元素,我想删除具有相同 Id 的元素.我尝试了以下代码但没有任何积极的结果(它没有找到重复的节点)

My use case is to remove the duplicated element , I want to remove the elements with same Id . I tried the following code with no positive outcome (its not finding the duplicate node)

import xml.etree.ElementTree as ET

path = 'old.xml'

tree = ET.parse(path)
root = tree.getroot()
prev = None

def elements_equal(e1, e2):
    if type(e1) != type(e2):
        return False
    if e1.tag != e1.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])

for page in root:                     # iterate over pages
elems_to_remove = []
for elem in page:
   for insideelem in page:
       if elements_equal(elem, insideelem) and elem != insideelem:
           print("found duplicate: %s" % insideelem.text)   # equal function works well
           elems_to_remove.append(insideelem)
           continue

for elem_to_remove in elems_to_remove:
    page.remove(elem_to_remove)
# [...]
tree.write("out.xml")

有人可以帮助我让我知道如何解决它.我对 Python 非常陌生,几乎零经验.

Can someone help me in letting me know how can i solve it. I am very new to python with almost zero experience .

推荐答案

首先你所做的是你正在使用的库中的一个难题,请看这个问题:如何在 python xml.etree.ElemenTree 中删除迭代器内的节点

First of all what you're doing is a hard problem in the library you're using, see this question: How to remove a node inside an iterator in python xml.etree.ElemenTree

对此的解决方案是使用 lxml,它实现了相同的 API,但具有额外的增强功能".然后您可以进行以下修复.

The solution to this would be to use lxml which "implements the same API but with additional enhancements". Then you can do the following fix.

您似乎只遍历 XML 树中的第二级节点.你得到了 root,然后带着孩子走.这将使您从第一页获得 parent2 并从第二页获得 element.此外,您不会在这里跨页面进行比较:

You seem to be only traversing the second level of nodes in your XML tree. You're getting root, then walking the children its children. This would get you parent2 from the first page and the element from your second page. Furthermore you wouldn't be comparing across pages here:

您的比较只会在同一页面中找到二级重复项.

使用适当的遍历函数(例如iter)选择正确的元素集:

Select the right set of elements using a proper traversal function such as iter:

# Use a `set` to keep track of "visited" elements with good lookup time.
visited = set()
# The iter method does a recursive traversal
for el in root.iter('element'):
    # Since the id is what defines a duplicate for you
    if 'id' in el.attr:
        current = el.get('id')
        # In visited already means it's a duplicate, remove it
        if current in visited:
            el.getparent().remove(el)
        # Otherwise mark this ID as "visited"
        else:
            visited.add(current)

这篇关于如何删除重复的节点xml Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆