遍历xml元素的有效方法 [英] Efficient way to iterate through xml elements

查看:95
本文介绍了遍历xml元素的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的xml:

I have a xml like this:

<a>
    <b>hello</b>
    <b>world</b>
</a>
<x>
    <y></y>
</x>
<a>
    <b>first</b>
    <b>second</b>
    <b>third</b>
</a>

我需要遍历所有<a><b>标记,但是我不知道文档中有多少标记.所以我用xpath来处理这个问题:

I need to iterate through all <a> and <b> tags, but I don't know how many of them are in document. So I use xpath to handle that:

from lxml import etree

doc = etree.fromstring(xml)

atags = doc.xpath('//a')
for a in atags:
    btags = a.xpath('b')
    for b in btags:
            print b

它可以工作,但是我有很大的文件,并且cProfile告诉我xpath使用起来非常昂贵.

It works, but I have pretty big files, and cProfile shows me that xpath is very expensive to use.

我想知道,也许有更有效的方法来遍历无限多个xml元素?

I wonder, maybe there is there more efficient way to iterate through indefinitely number of xml-elements?

推荐答案

XPath应该很快.您可以将XPath调用数减少为一个:

XPath should be fast. You can reduce the number of XPath calls to one:

doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
    print b.text

如果那还不够快,您可以尝试莉莎·戴利(Liza Daly)的fast_iter .这样做的好处是,不需要先使用etree.fromstring处理整个XML,并且在访问子级之后丢弃父级节点.这两件事都有助于减少内存需求.下面是 fast_iter 的修改版,它在删除不再需要的其他元素方面更具攻击性.

If that is not fast enough, you could try Liza Daly's fast_iter. This has the advantage of not requiring that the entire XML be processed with etree.fromstring first, and parent nodes are thrown away after the children have been visited. Both of these things help reduce the memory requirements. Below is a modified version of fast_iter which is more aggressive about removing other elements that are no longer needed.

def fast_iter(context, func, *args, **kwargs):
    """
    fast_iter is useful if you need to free memory while iterating through a
    very large XML file.

    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

def process_element(elt):
    print(elt.text)

context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)

关于解析大型XML文件的

Liza Daly的文章也读给你听.根据这篇文章,带有fast_iter的lxml可以比cElementTreeiterparse更快. (请参阅表1).

Liza Daly's article on parsing large XML files may prove useful reading to you too. According to the article, lxml with fast_iter can be faster than cElementTree's iterparse. (See Table 1).

这篇关于遍历xml元素的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆