使用Python lxml和Iterparse解析大型XML文件 [英] Parsing Large XML file with Python lxml and Iterparse

查看:292
本文介绍了使用Python lxml和Iterparse解析大型XML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用lxml和iterparse方法编写解析器,以逐步浏览包含许多项目的非常大的xml文件.

I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items.

我的文件格式为:

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
  <url>
     <item>http://www.url1.com</item>
  </url>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
  <url>
     <item>http://www.url2.com</item>
  </url>
</item>

到目前为止,我的解决方案是:

and so far my solution is:

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )
      elem.clear( )
      while elem.getprevious( ) is not None :
            del elem.getparent( )[0]

del context

运行它时,我得到类似于以下内容的信息:

When I run it, I get something similar to:

[]
['description1']
[]
['description2']

空白集是因为它还将子项的item标记拉到url标记中,并且它们显然没有要使用xpath提取的描述字段.我的希望是逐项分析每个项目,然后根据需要处理子字段.我只是在学习lxml库,所以我很好奇是否有一种方法可以提取主要项目,同时又可以保留任何子项目?

The blank sets are because it also pulls out the item tags that are children to the url tag, and they obviously have no description field to extract with xpath. My hope was to parse out each of the items 1 by 1 and then process the child fields as required. I'm sorta just learning the lxml libarary, so I'm curious if there is a way to pull out the main items while leaving any sub items alone if encountered?

推荐答案

无论如何,整个XML都是由核心实现解析的. etree.iterparse只是生成器样式的视图,它提供了按标记名称的简单过滤(请参阅docstring http://lxml.de/api/lxml.etree.iterparse-class.html ). 如果您要进行复杂的过滤,则应自己完成.

The entire xml is parsed anyway by the core implementation. The etree.iterparse is just a view in generator style, that provides a simple filtering by tag name (see docstring http://lxml.de/api/lxml.etree.iterparse-class.html). If you want a complex filtering you should do by it's own.

解决方案:还注册启动事件:

A solution: registering for start event also:

iterparse(self, source, events=("start", "end",), tag="item")

并通过bool知道何时到达"item"端,何时到达"item/url/item"端.

and have a bool to know when you are at the "item" end, when you are the "item/url/item" end.

这篇关于使用Python lxml和Iterparse解析大型XML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆