ElementTree iterparse策略 [英] ElementTree iterparse strategy

查看:231
本文介绍了ElementTree iterparse策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须处理足够大(最大1GB)的xml文档,并使用python解析它们.我正在使用 iterparse()函数(SAX样式解析).

I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing).

以下是我的担心,假设您有一个像这样的xml

My concern is the following, imagine you have an xml like this

<?xml version="1.0" encoding="UTF-8" ?>
<families>
  <family>
    <name>Simpson</name>
    <members>
        <name>Homer</name>
        <name>Marge</name>
        <name>Bart</name>
    </members>
  </family>
  <family>
    <name>Griffin</name>
    <members>
        <name>Peter</name>
        <name>Brian</name>
        <name>Meg</name>
    </members>
  </family>
</families>

问题是,当然知道我何时获得姓氏(如辛普森一家),以及何时获得该家庭成员之一的姓名(例如荷马)

The problem is, of course to know when I am getting a family name (as Simpsons) and when I am getting the name of one of that family member (for example Homer)

到目前为止,我一直在使用开关",它会告诉我是否在成员"标签中,代码看起来像这样

What I have been doing so far is to use "switches" which will tell me if I am inside a "members" tag or not, the code will look like this

import xml.etree.cElementTree as ET

__author__ = 'moriano'

file_path = "test.xml"
context = ET.iterparse(file_path, events=("start", "end"))

# turn it into an iterator
context = iter(context)
on_members_tag = False
for event, elem in context:
    tag = elem.tag
    value = elem.text
    if value :
        value = value.encode('utf-8').strip()

    if event == 'start' :
        if tag == "members" :
            on_members_tag = True

        elif tag == 'name' :
            if on_members_tag :
                print "The member of the family is %s" % value
            else :
                print "The family is %s " % value

    if event == 'end' and tag =='members' :
        on_members_tag = False
    elem.clear()

这很好,因为输出为

The family is Simpson 
The member of the family is Homer
The member of the family is Marge
The member of the family is Bart
The family is Griffin 
The member of the family is Peter
The member of the family is Brian
The member of the family is Meg

我担心的是,在这个(简单的)示例中,我不得不创建一个额外的变量来知道我在哪个标签(on_members_tag)中想象我要处理的真正的xml示例,它们具有更多的嵌套标签.

My concern is that with this (simple) example i had to create an extra variable to know in which tag i was (on_members_tag) imagine with the true xml examples that I have to handle, they have more nested tags.

还要注意,这是一个简化的示例,因此可以假设我可能面对的是带有更多标签,更多内部标签并试图获取不同标签名称,属性等的xml.

Also note that this is a very reduced example, so you can assume that i may be facing an xml with more tags, more inner tags and trying to get different tag names, attributes and so on.

问题是.我在这里做些愚蠢的事吗?我觉得必须对此有一个更优雅的解决方案.

So question is. Am I doing something horribly stupid here? I feel like there must be a more elegant solution to this.

推荐答案

这里是一种可能的方法:我们维护路径列表,然后向后窥视以查找父节点.

Here's one possible approach: we maintain a path list and peek backwards to find the parent node(s).

path = []
for event, elem in ET.iterparse(file_path, events=("start", "end")):
    if event == 'start':
        path.append(elem.tag)
    elif event == 'end':
        # process the tag
        if elem.tag == 'name':
            if 'members' in path:
                print 'member'
            else:
                print 'nonmember'
        path.pop()

这篇关于ElementTree iterparse策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆