解析XML文件获得UnicodeEncodeError(ElementTree)/ ValueError(lxml) [英] parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

查看:1014
本文介绍了解析XML文件获得UnicodeEncodeError(ElementTree)/ ValueError(lxml)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我向 CareerBuilder API 发送GET请求:

  import requests 

url =http://api.careerbuilder.com/v1/jobsearch
payload = {'DeveloperKey':'MY_DEVLOPER_KEY'
'JobTitle':'Biologist'}
r = requests.get(url,params = payload)
xml = r.text

并返回一个看起来像这个的XML。但是,我无法解析它。



使用 lxml

 >>>来自lxml import etree 
>>>打印etree.fromstring(xml)

追溯(最近的最后一次调用):
文件< pyshell#4>,第1行在< module>
打印etree.fromstring(xml)
文件lxml.etree.pyx,第2992行,在lxml.etree.fromstring(src\lxml\lxml.etree.c:62311)
文件parser.pxi,第1585行,lxml.etree._parseMemoryDocument(src\lxml\lxml.etree.c:91625)
ValueError:不支持具有编码声明的Unicode字符串。

ElementTree:

 追溯(最近的最后一次呼叫):
文件< pyshell#3>,第1行,< module>
打印ET.fromstring(xml)
文件C:\Python27\lib\xml\etree\ElementTree.py,第1301行,XML
parser.feed (文本)
文件C:\Python27\lib\xml\etree\ElementTree.py,行1641,在feed
self._parser.Parse(data,0)
UnicodeEncodeError:'ascii'编解码器无法编码位置3717中的字符u'\\xa0:序号不在范围(128)

所以,尽管XML文件以

 <?xml version = 1.0encoding =UTF-8?> 

我的印象是它包含不允许的字符。如何使用 lxml ElementTree 解析此文件?

解决方案

您正在使用解码的 unicode值。使用 r.raw 原始响应数据

  r = requests.get(url,params = payload ,stream = True)
r.raw.decode_content = True
etree.parse(r.raw)

它将直接从响应中读取数据;请注意 stream = True 选项到 .get()



设置 r.raw.decode_content = True 标志可确保原始套接字将为您提供解压缩内容,即使响应是gzip或压缩压缩。 p>

您不 要流式传输响应;对于较小的XML文档,使用 response.content 属性,这是未解码的响应主体:

  r = requests.get(url,params = payload)
xml = etree.fromstring(r.content)

XML解析器总是期望字节作为输入,因为XML格式本身决定了解析器如何将这些字节解码为Unicode文本。


I send a GET request to the CareerBuilder API :

import requests

url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
           'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text

And get back an XML that looks like this. However, I have trouble parsing it.

Using either lxml

>>> from lxml import etree
>>> print etree.fromstring(xml)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    print etree.fromstring(xml)
  File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311)
  File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.

or ElementTree:

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    print ET.fromstring(xml)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML
    parser.feed(text)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128)

So, even though the XML file starts with

<?xml version="1.0" encoding="UTF-8"?>

I have the impression that it contains characters that are not allowed. How do I parse this file with either lxmlor ElementTree?

解决方案

You are using the decoded unicode value. Use r.raw raw response data instead:

r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)

which will read the data from the response directly; do note the stream=True option to .get().

Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.

You don't have to stream the response; for smaller XML documents it is fine to use the response.content attribute, which is the un-decoded response body:

r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)

XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.

这篇关于解析XML文件获得UnicodeEncodeError(ElementTree)/ ValueError(lxml)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆