使用 lxml 和 iterparse() 来解析一个大 (+- 1Gb) XML 文件 [英] using lxml and iterparse() to parse a big (+- 1Gb) XML file

查看:34
本文介绍了使用 lxml 和 iterparse() 来解析一个大 (+- 1Gb) XML 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须解析具有如下结构的 1Gb XML 文件,并提取标签作者"和内容"中的文本:

I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content":

<Database>
    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    [...]

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>
</Database>

到目前为止,我已经尝试了两件事:i) 读取整个文件并使用 .find(xmltag) 浏览它,以及 ii) 使用 lxml 和 iterparse() 解析 xml 文件.我已经让它工作的第一个选项,但它很慢.第二个选项我还没有成功.

So far I've tried two things: i) reading the whole file and going through it with .find(xmltag) and ii) parsing the xml file with lxml and iterparse(). The first option I've got it to work, but it is very slow. The second option I haven't managed to get it off the ground.

这是我所拥有的一部分:

Here's part of what I have:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    if element.tag == "BlogPost":
        print element.text
    else:
        print 'Finished'

结果只有空格,没有文字.

The result of that is only blank spaces, with no text in them.

我一定是做错了什么,但我无法理解.另外,如果它不够明显,我对 python 很陌生,这是我第一次使用 lxml.请帮忙!

I must be doing something wrong, but I can't grasp it. Also, In case it wasn't obvious enough, I am quite new to python and it is the first time I'm using lxml. Please, help!

推荐答案

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  for child in element:
    print(child.tag, child.text)
    element.clear()

最终清除将阻止您使用过多内存.

the final clear will stop you from using too much memory.

[update:] 获取......之间的所有内容作为字符串";我想你想要其中之一:

[update:] to get "everything between ... as a string" i guess you want one of:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(etree.tostring(element))
  element.clear()

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([etree.tostring(child) for child in element]))
  element.clear()

甚至:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([child.text for child in element]))
  element.clear()

这篇关于使用 lxml 和 iterparse() 来解析一个大 (+- 1Gb) XML 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆