使用lxml解析大型XML [英] Parse large XML with lxml

查看:169
本文介绍了使用lxml解析大型XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使脚本工作.到目前为止,它什么都没输出.

I am trying to get my script working. So far it doesn't managed to output anything.

这是我的test.xml

This is my test.xml

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="it">
<page>
    <title>MediaWiki:Category</title>
    <ns>0</ns>
    <id>2</id>
    <revision>
      <id>11248</id>
      <timestamp>2003-12-31T13:47:54Z</timestamp>
      <contributor>
        <username>Frieda</username>
        <id>0</id>
      </contributor>
      <minor />
      <text xml:space="preserve">categoria</text>
      <sha1>0acykl71lto9v65yve23lmjgia1h6sz</sha1>
      <model>wikitext</model>
      <format>text/x-wiki</format>
    </revision>
  </page>
</mediawiki>

这是我的代码

from lxml import etree

def fast_iter(context, func):
    # fast_iter is useful if you need to free memory while iterating through a
    # very large XML file.
    #
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    for event, elem in context:
        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def process_element(elem):
    if elem.ns.text == '0':
        print elem.title.text

context=etree.iterparse('test.xml', events=('end',), tag='page')
fast_iter(context, process_element)

我没有任何错误,只是没有输出.我想得到的是解析元素是否为0.

I don't get any error, simply there's no output. What I want to get is to parse the element if is 0.

推荐答案

您正在解析一个带名称空间的文档,并且不存在'page'标记,因为该标记仅适用于没有标记的空间

You are parsing a namespaced document, and there is no 'page' tag present, because that only applies to tags without a namespace.

您正在寻找包含'{http://www.mediawiki.org/xml/export-0.8/}ns'元素的'{http://www.mediawiki.org/xml/export-0.8/}page'元素.

You are instead looking for the '{http://www.mediawiki.org/xml/export-0.8/}page' element, which contains a '{http://www.mediawiki.org/xml/export-0.8/}ns' element.

许多lxml方法确实允许您指定名称空间映射以使匹配更容易,但不幸的是iterparse()方法不是其中之一.

Many lxml methods do let you specify a namespace map to make matching easier, but the iterparse() method is not one of them, unfortunately.

以下.iterparse()调用肯定会处理正确的page标签:

The following .iterparse() call certainly processes the right page tags:

context = etree.iterparse('test.xml', events=('end',), tag='{http://www.mediawiki.org/xml/export-0.8/}page')

,但是您需要使用.find()来获取页面元素上的nstitle标签,或者使用xpath()调用来直接获取文本:

but you'll need to use .find() to get the ns and title tags on the page element, or use xpath() calls to get the text directly:

def process_element(elem):
    if elem.xpath("./*[local-name()='ns']/text()=0"):
        print elem.xpath("./*[local-name()='title']/text()")[0]

在您的输入示例中显示为

:

which, for your input example, prints:

>>> fast_iter(context, process_element)
MediaWiki:Category

这篇关于使用lxml解析大型XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆