检查并删除XML中重复的​​子标签 [英] Check and remove duplicated children tags in XML

查看:98
本文介绍了检查并删除XML中重复的​​子标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过python中的Element Tree解析类似XML的文件,并将内容写入pandas数据框.

I'm parsing an XML-like file via Element Tree in python and and writing the content to a pandas dataframe.

我当前面临以下问题:子标签的存在将因不同的标签而异.对于

I'm currently facing the following problem: The existence of children tags will be variant for different tags. This wouldn't be a problem with the solution mentioned here. However, the complicated part is that some tags have duplicated children tags while others don't. For example first product tag has two (different) article numbers and two equal product_types (duplicate) while the second only has one of each.

<main>
    <product>
       <article_nr>B00024J7C6</article_nr>
       <article_nr>44253</article_nr>
       <product_type>x</product_type>
       <product_type>x</product_type>
    </product>

    <product>
       <article_nr>B00024J7C7</article_nr>
       <product_type>y</product_type>
    </product>
</main>

我想做的是: 1.)删除"product_type"的重复项,然后 2)如果不存在第二article_nr,则将值设置为NULL,否则采用该值.

What I'd like to do is: 1.) remove the duplicates for 'product_type' and 2.) set the value NULL if there doesn't exist a second article_nr, otherwise take the value.

到目前为止,我的代码:

My code so far:

def create_dataframe(data):
    df = pd.DataFrame(columns=('article_nr', 'article_nr2', 'product_type', 'product_type2','product_type2'))
    for i in range(len(data)):
        obj = data.getchildren()[i].getchildren()
        row = dict(itertools.izip(['article_nr', 'article_nr2', 'product_type', 'product_type2','product_type2'], 
                       [obj[0].text, obj[1].text, obj[2].text, obj[3].text, obj[4].text]))
        row_s = pd.Series(row)
        row_s.name = i
        df = df.append(row_s)
    return df

这在第一个示例中很好,但显然不适用于第二个示例,因为第二个'article_nr'和'product_type'没有值.

This works fine with the first example, but obviously not with the second, because there are no values for the second 'article_nr' and 'product_type'.

输出应为:

article_nr    article_nr    product_type
B00024J7C6    44253           x
B00024J7C7    NULL            y

推荐答案

查看 Python从xml树中删除重复的元素,也许可以为您提供帮助. 像这样的东西:

Look at Python remove duplicate elements from xml tree ,maybe it can help you. Some Thing like this:

import xml.etree.ElementTree as ET
path = 'in.xml'
tree = ET.parse(path)
root = tree.getroot()
prev = None

def elements_equal(e1, e2):
    if type(e1) != type(e2):
        return False
    if e1.tag != e1.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])

for page in root:                     # iterate over pages
    elems_to_remove = []
    for elem in page:
        if elements_equal(elem, prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            elems_to_remove.append(elem)
            continue
        prev = elem
    for elem_to_remove in elems_to_remove:
        page.remove(elem_to_remove)
tree.write("out.xml")

这篇关于检查并删除XML中重复的​​子标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆