解析XML文件获得UnicodeEncodeError（ElementTree）/ ValueError（lxml） [英] parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

查看：1014 发布时间：2017/8/16 20:01:30 python encoding lxml elementtree python-requests

本文介绍了解析XML文件获得UnicodeEncodeError（ElementTree）/ ValueError（lxml）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我向 CareerBuilder API 发送GET请求：

  import requests 
 
 url =http://api.careerbuilder.com/v1/jobsearch
 payload = {'DeveloperKey'：'MY_DEVLOPER_KEY' 
'JobTitle'：'Biologist'} 
r = requests.get（url，params = payload）
 xml = r.text

并返回一个看起来像这个的XML。但是，我无法解析它。

使用 lxml

 >>>来自lxml import etree 
>>>打印etree.fromstring（xml）
 
追溯（最近的最后一次调用）：
文件< pyshell＃4>，第1行在< module> 
打印etree.fromstring（xml）
文件lxml.etree.pyx，第2992行，在lxml.etree.fromstring（src\lxml\lxml.etree.c：62311）
文件parser.pxi，第1585行，lxml.etree._parseMemoryDocument（src\lxml\lxml.etree.c：91625）
 ValueError：不支持具有编码声明的Unicode字符串。

或 ElementTree：

 追溯（最近的最后一次呼叫）：
文件< pyshell＃3>，第1行，< module> 
打印ET.fromstring（xml）
文件C：\Python27\lib\xml\etree\ElementTree.py，第1301行，XML 
 parser.feed （文本）
文件C：\Python27\lib\xml\etree\ElementTree.py，行1641，在feed 
 self._parser.Parse（data，0）
 UnicodeEncodeError：'ascii'编解码器无法编码位置3717中的字符u'\\xa0：序号不在范围（128）

所以，尽管XML文件以

 <？xml version = 1.0encoding =UTF-8？>

我的印象是它包含不允许的字符。如何使用 lxml 或 ElementTree 解析此文件？

解决方案

您正在使用解码的 unicode值。使用 r.raw 原始响应数据：

  r = requests.get（url，params = payload ，stream = True）
 r.raw.decode_content = True 
 etree.parse（r.raw）

它将直接从响应中读取数据;请注意 stream = True 选项到 .get（）。

设置 r.raw.decode_content = True 标志可确保原始套接字将为您提供解压缩内容，即使响应是gzip或压缩压缩。 p>

您不要流式传输响应;对于较小的XML文档，使用 response.content 属性，这是未解码的响应主体：

  r = requests.get（url，params = payload）
 xml = etree.fromstring（r.content）

XML解析器总是期望字节作为输入，因为XML格式本身决定了解析器如何将这些字节解码为Unicode文本。

I send a GET request to the CareerBuilder API :

import requests

url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
           'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text

And get back an XML that looks like this. However, I have trouble parsing it.

Using either lxml

>>> from lxml import etree
>>> print etree.fromstring(xml)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    print etree.fromstring(xml)
  File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311)
  File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.

or ElementTree:

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    print ET.fromstring(xml)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML
    parser.feed(text)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128)

So, even though the XML file starts with

<?xml version="1.0" encoding="UTF-8"?>

I have the impression that it contains characters that are not allowed. How do I parse this file with either lxmlor ElementTree?

解决方案

You are using the decoded unicode value. Use r.raw raw response data instead:

r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)

which will read the data from the response directly; do note the stream=True option to .get().

Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.

You don't have to stream the response; for smaller XML documents it is fine to use the response.content attribute, which is the un-decoded response body:

r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)

XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.

这篇关于解析XML文件获得UnicodeEncodeError（ElementTree）/ ValueError（lxml）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

解析XML文件获得UnicodeEncodeError（ElementTree）/ ValueError（lxml） [英] parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

解析XML文件获得UnicodeEncodeError（ElementTree）/ ValueError（lxml） [英] parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭