检查并删除XML中重复的子标签 [英] Check and remove duplicated children tags in XML
问题描述
我正在通过python中的Element Tree解析类似XML的文件,并将内容写入pandas数据框.
I'm parsing an XML-like file via Element Tree in python and and writing the content to a pandas dataframe.
I'm currently facing the following problem: The existence of children tags will be variant for different tags. This wouldn't be a problem with the solution mentioned here. However, the complicated part is that some tags have duplicated children tags while others don't. For example first product tag has two (different) article numbers and two equal product_types (duplicate) while the second only has one of each.
<main>
<product>
<article_nr>B00024J7C6</article_nr>
<article_nr>44253</article_nr>
<product_type>x</product_type>
<product_type>x</product_type>
</product>
<product>
<article_nr>B00024J7C7</article_nr>
<product_type>y</product_type>
</product>
</main>
我想做的是: 1.)删除"product_type"的重复项,然后 2)如果不存在第二article_nr,则将值设置为NULL,否则采用该值.
What I'd like to do is: 1.) remove the duplicates for 'product_type' and 2.) set the value NULL if there doesn't exist a second article_nr, otherwise take the value.
到目前为止,我的代码:
My code so far:
def create_dataframe(data):
df = pd.DataFrame(columns=('article_nr', 'article_nr2', 'product_type', 'product_type2','product_type2'))
for i in range(len(data)):
obj = data.getchildren()[i].getchildren()
row = dict(itertools.izip(['article_nr', 'article_nr2', 'product_type', 'product_type2','product_type2'],
[obj[0].text, obj[1].text, obj[2].text, obj[3].text, obj[4].text]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
return df
这在第一个示例中很好,但显然不适用于第二个示例,因为第二个'article_nr'和'product_type'没有值.
This works fine with the first example, but obviously not with the second, because there are no values for the second 'article_nr' and 'product_type'.
输出应为:
article_nr article_nr product_type
B00024J7C6 44253 x
B00024J7C7 NULL y
推荐答案
查看 Python从xml树中删除重复的元素,也许可以为您提供帮助. 像这样的东西:
Look at Python remove duplicate elements from xml tree ,maybe it can help you. Some Thing like this:
import xml.etree.ElementTree as ET
path = 'in.xml'
tree = ET.parse(path)
root = tree.getroot()
prev = None
def elements_equal(e1, e2):
if type(e1) != type(e2):
return False
if e1.tag != e1.tag: return False
if e1.text != e2.text: return False
if e1.tail != e2.tail: return False
if e1.attrib != e2.attrib: return False
if len(e1) != len(e2): return False
return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])
for page in root: # iterate over pages
elems_to_remove = []
for elem in page:
if elements_equal(elem, prev):
print("found duplicate: %s" % elem.text) # equal function works well
elems_to_remove.append(elem)
continue
prev = elem
for elem_to_remove in elems_to_remove:
page.remove(elem_to_remove)
tree.write("out.xml")
这篇关于检查并删除XML中重复的子标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!