使用 Python lxml 处理嵌套元素 [英] Handling nested elements with Python lxml

查看:40
本文介绍了使用 Python lxml 处理嵌套元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定以下简单的 XML 数据:

Given the simple XML data below:

<book>
   <title>My First Book</title>
   <abstract>
         <para>First paragraph of the abstract</para>
         <para>Second paragraph of the abstract</para>
    </abstract>
    <keywordSet>
         <keyword>First keyword</keyword>
         <keyword>Second keyword</keyword>
         <keyword>Third keyword</keyword>
    </keywordSet>
</book>

如何使用 lxml 遍历树,并获取抽象"元素中的所有段落,以及keywordSet"元素中的所有关键字?

How can I traverse the tree, using lxml, and get all paragraphs in the "abstract" element, as well as all keywords in the "keywordSet" element?

下面的代码片段仅返回每个元素中的第一行文本:

The code snippet below returns only the first line of text in each element:

from lxml import objectify
root = objectify.fromstring(xml_string) # xml_string contains the XML data above
print root.title # returns the book title
for line in root.abstract:
    print line.para # returns only yhe first paragraph
for word in root.keywordSet:
    print word.keyword # returns only the first keyword in the set

我试着按照这个例子,但上面的代码没有按预期工作.

I tried to follow this example, but the code above doesn't work as expected.

换一种方式,将整个 XML 树读入 Python 字典会更好,其中每个元素作为键,每个文本作为元素项.我发现使用 lxml objectify 可以实现类似的功能,但我不知道如何实现它.

On a different tack, still better would be able to read the entire XML tree into a Python dictionary, with each element as the key and each text as the element item(s). I found out that something like this might be possible using lxml objectify, but I couldn't figure out how to achieve it.

在尝试用 Python 编写 XML 解析代码时,我发现的一个非常大的问题是,提供的大多数示例"都太简单且完全是虚构的,没有多大帮助——否则它们正好相反,使用过于复杂的自动生成的 XML 数据!

One really big problem I have been finding when attempting to write XML parsing code in Python is that most of the "examples" provided are just too simple and entirely fictitious to be of much help -- or else they are just the opposite, using too complicated automatically-generated XML data!

谁能给我一个提示?

提前致谢!

发布这个问题后,我找到了一个简单的解决方案here.

After posting this question, I found a simple solution here.

所以,我更新后的代码变成:

So, my updated code becomes:

from lxml import objectify
    root = objectify.fromstring(xml_string) # xml_string contains the XML data above
    print root.title # returns the book title
    for para in root.abstract.iterchildren():
        print para # now returns the text of all paragraphs
    for keyword in root.keywordSet.iterchildren():
        print keyword # now returns all keywords in the set

推荐答案

这很简单,使用 XPath:>

This is pretty simple using XPath:

from lxml import etree

tree = etree.parse('data.xml')

paragraphs = tree.xpath('/abstract/para/text()')
keywords = tree.xpath('/keywordSet/keyword/text()')

print paragraphs
print keywords

输出:

['First paragraph of the abstract', 'Second paragraph of the abstract']
['First keyword', 'Second keyword', 'Third keyword']

有关 XPath 语法的详细信息,请参阅 W3Schools 的 XPath 教程.

See the XPath Tutorial at W3Schools for details on the XPath syntax.

特别是上面表达式中使用的元素使用

In particular, the elements used in the expressions above use

  • / 选择器选择根节点/直接子节点.
  • text() 运算符,用于选择各个元素的文本节点(文本内容").
  • The / selector to select the root node / the immediate children.
  • The text() operator to select the text node (the "textual content") of the respective elements.

以下是使用 Objectify API 的方法:

Here's how it could be done using the Objectify API:

from lxml import objectify

root = objectify.fromstring(xml_string)

paras = [p.text for p in root.abstract.para]
keywords = [k.text for k in root.keywordSet.keyword]

print paras
print keywords

似乎root.abstract.para实际上是简写root.abstract.para[0].所以你需要明确地使用 element.iterchildren() 来访问所有的子元素.

It seems that root.abstract.para is actually shorthand for root.abstract.para[0]. So you need to explicitly use element.iterchildren() to access all child elements.

这不是真的,我们显然都误解了 Objectify API:为了迭代abstract中的para,你需要迭代root.abstract.para,而不是root.abstract 本身.这很奇怪,因为您直观地将 abstract 视为其节点的集合或容器,而该容器将由 Python 可迭代对象表示.但实际上是 .para 选择器代表了序列.

That's not true, we obviously both misunderstood the Objectify API: In order to iterate over the paras in abstract, you need to iterate over root.abstract.para, not root.abstract itself. It's weird, because you intuitively think about abstract as a collection or a container for its nodes, and that container would be represented by a Python iterable. But it's actually the .para selector that represents the sequence.

这篇关于使用 Python lxml 处理嵌套元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆