如何删除重复的节点xml Python [英] How to remove duplicate nodes xml Python
问题描述
我有一个特殊的 xml 文件结构类似于:
I have a special case xml file structure is something like :
<Root>
<parent1>
<parent2>
<element id="Something" >
</parent2>
</parent1>
<parent1>
<element id="Something">
</parent1>
</Root>
我的用例是删除重复的元素,我想删除具有相同 Id 的元素.我尝试了以下代码但没有任何积极的结果(它没有找到重复的节点)
My use case is to remove the duplicated element , I want to remove the elements with same Id . I tried the following code with no positive outcome (its not finding the duplicate node)
import xml.etree.ElementTree as ET
path = 'old.xml'
tree = ET.parse(path)
root = tree.getroot()
prev = None
def elements_equal(e1, e2):
if type(e1) != type(e2):
return False
if e1.tag != e1.tag: return False
if e1.text != e2.text: return False
if e1.tail != e2.tail: return False
if e1.attrib != e2.attrib: return False
if len(e1) != len(e2): return False
return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])
for page in root: # iterate over pages
elems_to_remove = []
for elem in page:
for insideelem in page:
if elements_equal(elem, insideelem) and elem != insideelem:
print("found duplicate: %s" % insideelem.text) # equal function works well
elems_to_remove.append(insideelem)
continue
for elem_to_remove in elems_to_remove:
page.remove(elem_to_remove)
# [...]
tree.write("out.xml")
有人可以帮助我让我知道如何解决它.我对 Python 非常陌生,几乎零经验.
Can someone help me in letting me know how can i solve it. I am very new to python with almost zero experience .
推荐答案
首先你所做的是你正在使用的库中的一个难题,请看这个问题:如何在 python xml.etree.ElemenTree 中删除迭代器内的节点
First of all what you're doing is a hard problem in the library you're using, see this question: How to remove a node inside an iterator in python xml.etree.ElemenTree
对此的解决方案是使用 lxml
,它实现了相同的 API,但具有额外的增强功能".然后您可以进行以下修复.
The solution to this would be to use lxml
which "implements the same API but with additional enhancements". Then you can do the following fix.
您似乎只遍历 XML 树中的第二级节点.你得到了 root
,然后带着孩子走.这将使您从第一页获得 parent2
并从第二页获得 element
.此外,您不会在这里跨页面进行比较:
You seem to be only traversing the second level of nodes in your XML tree. You're getting root
, then walking the children its children. This would get you parent2
from the first page and the element
from your second page. Furthermore you wouldn't be comparing across pages here:
您的比较只会在同一页面中找到二级重复项.
使用适当的遍历函数(例如iter
)选择正确的元素集:
Select the right set of elements using a proper traversal function such as iter
:
# Use a `set` to keep track of "visited" elements with good lookup time.
visited = set()
# The iter method does a recursive traversal
for el in root.iter('element'):
# Since the id is what defines a duplicate for you
if 'id' in el.attr:
current = el.get('id')
# In visited already means it's a duplicate, remove it
if current in visited:
el.getparent().remove(el)
# Otherwise mark this ID as "visited"
else:
visited.add(current)
这篇关于如何删除重复的节点xml Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!