如何使用lxml来获取XML文档的特定部分? [英] How to use lxml to grab specific parts of an XML document?

查看:131
本文介绍了如何使用lxml来获取XML文档的特定部分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Amazon的API来接收有关书籍的信息.我正在尝试使用lxml提取应用程序所需的XMl文档的特定部分. 不过,我不太确定如何使用lxml. 据我所知:

I am using Amazon's API to receive information about books. I am trying to use lxml to extract specific parts of the XMl document that are needed for my application. I am not really sure how to use lxml, though. This is as far as I have gotten:

root = etree.XML(response)

为XML文档创建etree对象.

To create a etree object for the XML document.

这是XML文档的样子: http://pastebin.com/GziDkf1a 实际上有多个项目",但是我仅粘贴了其中一个以给您提供一个具体示例. 对于每个项目,我要提取标题和ISBN.我该如何使用我拥有的etree对象呢?

Here is what the XML document looks like: http://pastebin.com/GziDkf1a There are actually multiple "Items", but I only pasted one of them to give you a specific example. For each item, I want to extract the title and ISBN. How do I do that with the etree object that I have?

<ItemSearchResponse><Items><Item><ItemAttributes><Title>I want this info</Title></ItemAttributes></Item></Items></ItemSearchResponse

<ItemSearchResponse><Items><Item><ItemAttributes><ISBN>And I want this info</ISBN></ItemAttributes></Item></Items></ItemSearchResponse

基本上,我不知道如何使用我的etree对象遍历树,我想学习如何.

Basically, I do not know how to traverse the tree using my etree object, and I want to learn how.

修改1: 我正在尝试以下代码:

Edit 1: I am trying the following code:

tree = etree.fromstring(response)
for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
    print(item)
    print(item.items()) # Apparently, there is nothing in item.items()
    for key, value in item.items():
        print(key)
        print(value)

但是我得到以下输出: http://dpaste.com/287496/

But I get the following output: http://dpaste.com/287496/

我添加了print(item.items()),它似乎只是一个空列表.虽然每个项目都是一个元素,但是由于某种原因,它们没有任何项目.

I added the print(item.items()), and it just seems to be an empty list. Each item is an Element, though, but for some reason, they have no items.

我可以使用以下代码来获取所需的信息,但看来lxml必须有一种更简单的方法...(这种方法似乎效率不高):

Edit 2: I can use the following code to get the information I want, but it seems like lxml must have an easier way... (this way doesn't seem very efficient):

for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
    title_text = ""
    author_text = ""
    isbn_text = ""
    for isbn in item.iterfind(".//"+AMAZON_NS+"ISBN"):
        isbn_text = isbn.text
    for title in item.iterfind(".//"+AMAZON_NS+"Title"):
        title_text = title.text
    for author in item.iterfind(".//"+AMAZON_NS+"Author"):
        author_text = author.text
    print(title_text + " by " + author_text + " has ISBN: " + isbn_text)

推荐答案

由于您将整个响应作为一个大型XML字符串获取,因此可以使用lxml的'fromstring'方法将其转换为完整的ElementTree对象.然后,您可以使用findall函数(或者,实际上,因为您要对结果进行迭代,因此使用iterfind函数),但是有一个陷阱:Amazon的XML响应已命名为名称空间,因此必须为lxml库考虑到这一点正确搜索它.这样的事情应该可以解决问题:

Since you're getting the entire response as one large XML string, you can use lxml's 'fromstring' method to get it into a complete ElementTree object. Then, you can use the findall function (or actually, since you want to iterate over the results, the iterfind function), but there's a catch: Amazon's XML responses are namespaced, so you have to account for that in order for the lxml libraries to properly search it. Something like this ought to do the trick:

root=etree.fromstring(responseFromAmazon)

# this creates a constant with the namespace in the form that lxml can use it
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"

# this searches the tree and iterates over results, taking the namespace into account
for eachitem in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
   for key,value in eachitem.items():
        if key == 'ISBN':
              # Do your stuff
        if key == 'Title':
              # Do your stuff

编辑1

看看效果是否更好:

root=etree.fromstring(responseFromAmazon)
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
item={}    
for attr in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
     item[attr[0].tag.replace(AMAZON_NS,"")]=attr[0].text

然后,您可以根据需要访问item ["Title"],item ["ISBN"]等.

Then, you can access item["Title"], item["ISBN"], etc. as needed.

这篇关于如何使用lxml来获取XML文档的特定部分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆