解析XML文件获得UnicodeEncodeError(ElementTree)/ ValueError(lxml) [英] parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)
问题描述
我向 CareerBuilder API 发送GET请求:
import requests
url =http://api.careerbuilder.com/v1/jobsearch
payload = {'DeveloperKey':'MY_DEVLOPER_KEY'
'JobTitle':'Biologist'}
r = requests.get(url,params = payload)
xml = r.text
并返回一个看起来像这个的XML。但是,我无法解析它。
使用 lxml
>>>来自lxml import etree
>>>打印etree.fromstring(xml)
追溯(最近的最后一次调用):
文件< pyshell#4>,第1行在< module>
打印etree.fromstring(xml)
文件lxml.etree.pyx,第2992行,在lxml.etree.fromstring(src\lxml\lxml.etree.c:62311)
文件parser.pxi,第1585行,lxml.etree._parseMemoryDocument(src\lxml\lxml.etree.c:91625)
ValueError:不支持具有编码声明的Unicode字符串。
或 ElementTree:
追溯(最近的最后一次呼叫):
文件< pyshell#3>,第1行,< module>
打印ET.fromstring(xml)
文件C:\Python27\lib\xml\etree\ElementTree.py,第1301行,XML
parser.feed (文本)
文件C:\Python27\lib\xml\etree\ElementTree.py,行1641,在feed
self._parser.Parse(data,0)
UnicodeEncodeError:'ascii'编解码器无法编码位置3717中的字符u'\\xa0:序号不在范围(128)
所以,尽管XML文件以
<?xml version = 1.0encoding =UTF-8?>
我的印象是它包含不允许的字符。如何使用 lxml
或 ElementTree
解析此文件?
您正在使用解码的 unicode值。使用 r.raw
原始响应数据:
r = requests.get(url,params = payload ,stream = True)
r.raw.decode_content = True
etree.parse(r.raw)
它将直接从响应中读取数据;请注意 stream = True
选项到 .get()
。
设置 r.raw.decode_content = True
标志可确保原始套接字将为您提供解压缩内容,即使响应是gzip或压缩压缩。 p>
您不 要流式传输响应;对于较小的XML文档,使用 response.content
属性,这是未解码的响应主体:
r = requests.get(url,params = payload)
xml = etree.fromstring(r.content)
XML解析器总是期望字节作为输入,因为XML格式本身决定了解析器如何将这些字节解码为Unicode文本。
I send a GET request to the CareerBuilder API :
import requests
url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text
And get back an XML that looks like this. However, I have trouble parsing it.
Using either lxml
>>> from lxml import etree
>>> print etree.fromstring(xml)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
print etree.fromstring(xml)
File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311)
File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.
or ElementTree:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print ET.fromstring(xml)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128)
So, even though the XML file starts with
<?xml version="1.0" encoding="UTF-8"?>
I have the impression that it contains characters that are not allowed. How do I parse this file with either lxml
or ElementTree
?
You are using the decoded unicode value. Use r.raw
raw response data instead:
r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)
which will read the data from the response directly; do note the stream=True
option to .get()
.
Setting the r.raw.decode_content = True
flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.
You don't have to stream the response; for smaller XML documents it is fine to use the response.content
attribute, which is the un-decoded response body:
r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)
XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.
这篇关于解析XML文件获得UnicodeEncodeError(ElementTree)/ ValueError(lxml)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!