使用lxml和iterparse()解析大(+-1Gb)XML文件 [英] using lxml and iterparse() to parse a big (+- 1Gb) XML file

查看:105
本文介绍了使用lxml和iterparse()解析大(+-1Gb)XML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须解析具有如下结构的1Gb XML文件,并提取标签作者"和内容"中的文本:

I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content":

<Database>
    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    [...]

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>
</Database>

到目前为止,我已经尝试了两件事:i)读取整个文件并使用.find(xmltag)进行遍历,并且ii)使用lxml和iterparse()解析xml文件. 第一个选项是我可以使用它,但是速度很慢.第二种选择我还没有成功.

So far I've tried two things: i) reading the whole file and going through it with .find(xmltag) and ii) parsing the xml file with lxml and iterparse(). The first option I've got it to work, but it is very slow. The second option I haven't managed to get it off the ground.

这是我拥有的一部分:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    if element.tag == "BlogPost":
        print element.text
    else:
        print 'Finished'

结果是只有空格,没有文本.

The result of that is only blank spaces, with no text in them.

我一定做错了,但我无法把握.另外,如果还不够明显,我是python的新手,这是我第一次使用lxml.请帮忙!

I must be doing something wrong, but I can't grasp it. Also, In case it wasn't obvious enough, I am quite new to python and it is the first time I'm using lxml. Please, help!

推荐答案

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    for child in element:
        print child.tag, child.text
    element.clear()

最后的清除会阻止您使用过多的内存.

the final clear will stop you from using too much memory.

[更新:]要获得"...之间的所有信息,作为字符串",我想您需要其中的一个:

[update:] to get "everything between ... as a string" i guess you want one of:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print etree.tostring(element)
    element.clear()

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print ''.join([etree.tostring(child) for child in element])
    element.clear()

或者甚至:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    print ''.join([child.text for child in element])
    element.clear()

这篇关于使用lxml和iterparse()解析大(+-1Gb)XML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆