如何使用lxml来获取XML文档的特定部分? [英] How to use lxml to grab specific parts of an XML document?
问题描述
我正在使用Amazon的API来接收有关书籍的信息.我正在尝试使用lxml提取应用程序所需的XMl文档的特定部分. 不过,我不太确定如何使用lxml. 据我所知:
I am using Amazon's API to receive information about books. I am trying to use lxml to extract specific parts of the XMl document that are needed for my application. I am not really sure how to use lxml, though. This is as far as I have gotten:
root = etree.XML(response)
为XML文档创建etree对象.
To create a etree object for the XML document.
这是XML文档的样子: http://pastebin.com/GziDkf1a 实际上有多个项目",但是我仅粘贴了其中一个以给您提供一个具体示例. 对于每个项目,我要提取标题和ISBN.我该如何使用我拥有的etree对象呢?
Here is what the XML document looks like: http://pastebin.com/GziDkf1a There are actually multiple "Items", but I only pasted one of them to give you a specific example. For each item, I want to extract the title and ISBN. How do I do that with the etree object that I have?
<ItemSearchResponse><Items><Item><ItemAttributes><Title>I want this info</Title></ItemAttributes></Item></Items></ItemSearchResponse
<ItemSearchResponse><Items><Item><ItemAttributes><ISBN>And I want this info</ISBN></ItemAttributes></Item></Items></ItemSearchResponse
基本上,我不知道如何使用我的etree对象遍历树,我想学习如何.
Basically, I do not know how to traverse the tree using my etree object, and I want to learn how.
修改1: 我正在尝试以下代码:
Edit 1: I am trying the following code:
tree = etree.fromstring(response)
for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
print(item)
print(item.items()) # Apparently, there is nothing in item.items()
for key, value in item.items():
print(key)
print(value)
但是我得到以下输出: http://dpaste.com/287496/
But I get the following output: http://dpaste.com/287496/
我添加了print(item.items()),它似乎只是一个空列表.虽然每个项目都是一个元素,但是由于某种原因,它们没有任何项目.
I added the print(item.items()), and it just seems to be an empty list. Each item is an Element, though, but for some reason, they have no items.
我可以使用以下代码来获取所需的信息,但看来lxml必须有一种更简单的方法...(这种方法似乎效率不高):
Edit 2: I can use the following code to get the information I want, but it seems like lxml must have an easier way... (this way doesn't seem very efficient):
for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
title_text = ""
author_text = ""
isbn_text = ""
for isbn in item.iterfind(".//"+AMAZON_NS+"ISBN"):
isbn_text = isbn.text
for title in item.iterfind(".//"+AMAZON_NS+"Title"):
title_text = title.text
for author in item.iterfind(".//"+AMAZON_NS+"Author"):
author_text = author.text
print(title_text + " by " + author_text + " has ISBN: " + isbn_text)
推荐答案
由于您将整个响应作为一个大型XML字符串获取,因此可以使用lxml的'fromstring'方法将其转换为完整的ElementTree对象.然后,您可以使用findall函数(或者,实际上,因为您要对结果进行迭代,因此使用iterfind函数),但是有一个陷阱:Amazon的XML响应已命名为名称空间,因此必须为lxml库考虑到这一点正确搜索它.这样的事情应该可以解决问题:
Since you're getting the entire response as one large XML string, you can use lxml's 'fromstring' method to get it into a complete ElementTree object. Then, you can use the findall function (or actually, since you want to iterate over the results, the iterfind function), but there's a catch: Amazon's XML responses are namespaced, so you have to account for that in order for the lxml libraries to properly search it. Something like this ought to do the trick:
root=etree.fromstring(responseFromAmazon)
# this creates a constant with the namespace in the form that lxml can use it
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
# this searches the tree and iterates over results, taking the namespace into account
for eachitem in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
for key,value in eachitem.items():
if key == 'ISBN':
# Do your stuff
if key == 'Title':
# Do your stuff
编辑1
看看效果是否更好:
root=etree.fromstring(responseFromAmazon)
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
item={}
for attr in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
item[attr[0].tag.replace(AMAZON_NS,"")]=attr[0].text
然后,您可以根据需要访问item ["Title"],item ["ISBN"]等.
Then, you can access item["Title"], item["ISBN"], etc. as needed.
这篇关于如何使用lxml来获取XML文档的特定部分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!