解析 XML 文件得到 UnicodeEncodeError (ElementTree)/ValueError (lxml) [英] parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)
问题描述
我向 CareerBuilder API 发送了一个 GET 请求:
导入请求url = "http://api.careerbuilder.com/v1/jobsearch"有效载荷 = {'DeveloperKey': 'MY_DEVLOPER_KEY','JobTitle':'生物学家'}r = requests.get(url, params=payload)xml = r.text
并返回一个类似于这个的 XML.但是,我无法解析它.
使用任一 lxml
或 ElementTree:
回溯(最近一次调用最后一次):文件<pyshell#3>",第 1 行,在 <module> 中打印 ET.fromstring(xml)文件C:Python27libxmletreeElementTree.py",第 1301 行,XMLparser.feed(文本)文件C:Python27libxmletreeElementTree.py",第 1641 行,在提要中self._parser.Parse(数据,0)UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 3717: ordinal not in range(128)
所以,即使 XML 文件以
开头
我的印象是它包含不允许的字符.如何使用 lxml
或 ElementTree
解析此文件?
您正在使用 解码 unicode 值.使用 r.raw
原始响应数据代替:
r = requests.get(url, params=payload, stream=True)r.raw.decode_content = 真etree.parse(r.raw)
它将直接从响应中读取数据;请注意 .get()
的 stream=True
选项.
设置 r.raw.decode_content = True
标志可确保原始套接字将为您提供解压缩的内容,即使响应是 gzip 或 deflate 压缩的.
您不必流式传输响应;对于较小的 XML 文档,可以使用 响应.content
属性,即未解码的响应体:
r = requests.get(url, params=payload)xml = etree.fromstring(r.content)
XML 解析器总是期望字节作为输入,因为 XML 格式本身决定了解析器如何将这些字节解码为 Unicode 文本.
I send a GET request to the CareerBuilder API :
import requests
url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text
And get back an XML that looks like this. However, I have trouble parsing it.
Using either lxml
>>> from lxml import etree
>>> print etree.fromstring(xml)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
print etree.fromstring(xml)
File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (srclxmllxml.etree.c:62311)
File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (srclxmllxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.
or ElementTree:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print ET.fromstring(xml)
File "C:Python27libxmletreeElementTree.py", line 1301, in XML
parser.feed(text)
File "C:Python27libxmletreeElementTree.py", line 1641, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 3717: ordinal not in range(128)
So, even though the XML file starts with
<?xml version="1.0" encoding="UTF-8"?>
I have the impression that it contains characters that are not allowed. How do I parse this file with either lxml
or ElementTree
?
You are using the decoded unicode value. Use r.raw
raw response data instead:
r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)
which will read the data from the response directly; do note the stream=True
option to .get()
.
Setting the r.raw.decode_content = True
flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.
You don't have to stream the response; for smaller XML documents it is fine to use the response.content
attribute, which is the un-decoded response body:
r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)
XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.
这篇关于解析 XML 文件得到 UnicodeEncodeError (ElementTree)/ValueError (lxml)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!