使用LXML和Python解析空白XML标签 [英] Parsing blank XML tags with LXML and Python

查看:406
本文介绍了使用LXML和Python解析空白XML标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以以下格式解析XML文档时:

When parsing XML documents in the format of:

<Car>
    <Color>Blue</Color>
    <Make>Chevy</Make>
    <Model>Camaro</Model>
</Car>

我使用以下代码:

carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]')
parsedCarData = [{field.tag: field.text for field in carData} for action in carData]
print parsedCarData[0]['Color'] #Blue

如果标签为空,例如:

<Car>
    <Color>Blue</Color>
    <Make>Chevy</Make>
    <Model/>
</Car>

使用与上面相同的代码:

Using the same code as above:

carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]')
parsedCarData = [{field.tag: field.text for field in carData} for action in carData]
print parsedCarData[0]['Model'] #Key Error

我将如何解析这个空白标签.

How would I parse this blank tag.

推荐答案

您要放入一个[text()]过滤器,该过滤器显式地仅询问具有文本节点的元素...然后,您对它不满意不会给您没有文本节点的元素吗?

You're putting in a [text()] filter which explicitly asks only for elements which have text nodes them... and then you're unhappy when it doesn't give you elements without text nodes?

不进行过滤,您将获得模型元素:

Leave that filter out, and you'll get your model element:

>>> s='''
... <root>
...   <Car>
...     <Color>Blue</Color>
...     <Make>Chevy</Make>
...     <Model/>
...   </Car>
... </root>'''
>>> e = lxml.etree.fromstring(s)
>>> carData = e.xpath('Car/node()')
>>> carData
[<Element Color at 0x23a5460>, <Element Make at 0x23a54b0>, <Element Model at 0x23a5500>]
>>> dict(((e.tag, e.text) for e in carData))
{'Color': 'Blue', 'Make': 'Chevy', 'Model': None}

也就是说-如果您的近期目标是遍历树中的节点,则可以考虑使用lxml.etree.iterparse(),这将避免尝试在内存中构建完整的DOM树,否则比构建树效率更高.一棵树,然后使用XPath对其进行遍历. (想想SAX,但没有疯狂而痛苦的API).

That said -- if your immediate goal is to iterate over the nodes in the tree, you might consider using lxml.etree.iterparse() instead, which will avoid trying to build a full DOM tree in memory and otherwise be much more efficient than building a tree and then iterating over it with XPath. (Think SAX, but without the insane and painful API).

iterparse实施可能看起来像这样:

Implementing with iterparse could look like this:

def get_cars(infile):
    in_car = False
    current_car = {}
    for (event, element) in lxml.etree.iterparse(infile, events=('start', 'end')):
        if event == 'start':
            if element.tag == 'Car':
                in_car = True
                current_car = {}
            continue
        if not in_car: continue
        if element.tag == 'Car':
            yield current_car
            continue
        current_car[element.tag] = element.text

for car in get_cars(infile = cStringIO.StringIO('''<root><Car><Color>Blue</Color><Make>Chevy</Make><Model/></Car></root>''')):
  print car

...更多代码,但是(如果我们不使用StringIO作为示例),它可以处理比内存大得多的文件.

...it's more code, but (if we weren't using StringIO for the example) it could process a file much larger than could fit in memory.

这篇关于使用LXML和Python解析空白XML标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆