解析 XML 文件得到 UnicodeEncodeError (ElementTree)/ValueError (lxml) [英] parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

查看:17
本文介绍了解析 XML 文件得到 UnicodeEncodeError (ElementTree)/ValueError (lxml)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我向 CareerBuilder API 发送了一个 GET 请求:

导入请求url = "http://api.careerbuilder.com/v1/jobsearch"有效载荷 = {'DeveloperKey': 'MY_DEVLOPER_KEY','JobTitle':'生物学家'}r = requests.get(url, params=payload)xml = r.text

并返回一个类似于这个的 XML.但是,我无法解析它.

使用任一 lxml

<预><代码>>>>从 lxml 导入 etree>>>打印 etree.fromstring(xml)回溯(最近一次调用最后一次):文件<pyshell#4>",第 1 行,在 <module> 中打印 etree.fromstring(xml)文件lxml.etree.pyx",第 2992 行,在 lxml.etree.fromstring (srclxmllxml.etree.c:62311)文件parser.pxi",第 1585 行,在 lxml.etree._parseMemoryDocument (srclxmllxml.etree.c:91625)ValueError: 不支持带有编码声明的 Unicode 字符串.

ElementTree:

回溯(最近一次调用最后一次):文件<pyshell#3>",第 1 行,在 <module> 中打印 ET.fromstring(xml)文件C:Python27libxmletreeElementTree.py",第 1301 行,XMLparser.feed(文本)文件C:Python27libxmletreeElementTree.py",第 1641 行,在提要中self._parser.Parse(数据,0)UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 3717: ordinal not in range(128)

所以,即使 XML 文件以

开头

我的印象是它包含不允许的字符.如何使用 lxmlElementTree 解析此文件?

解决方案

您正在使用 解码 unicode 值.使用 r.raw 原始响应数据代替:

r = requests.get(url, params=payload, stream=True)r.raw.decode_content = 真etree.parse(r.raw)

它将直接从响应中读取数据;请注意 .get()stream=True 选项.

设置 r.raw.decode_content = True 标志可确保原始套接字将为您提供解压缩的内容,即使响应是 gzip 或 deflate 压缩的.

您不必流式传输响应;对于较小的 XML 文档,可以使用 响应.content 属性,即未解码的响应体:

r = requests.get(url, params=payload)xml = etree.fromstring(r.content)

XML 解析器总是期望字节作为输入,因为 XML 格式本身决定了解析器如何将这些字节解码为 Unicode 文本.

I send a GET request to the CareerBuilder API :

import requests

url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
           'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text

And get back an XML that looks like this. However, I have trouble parsing it.

Using either lxml

>>> from lxml import etree
>>> print etree.fromstring(xml)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    print etree.fromstring(xml)
  File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (srclxmllxml.etree.c:62311)
  File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (srclxmllxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.

or ElementTree:

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    print ET.fromstring(xml)
  File "C:Python27libxmletreeElementTree.py", line 1301, in XML
    parser.feed(text)
  File "C:Python27libxmletreeElementTree.py", line 1641, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 3717: ordinal not in range(128)

So, even though the XML file starts with

<?xml version="1.0" encoding="UTF-8"?>

I have the impression that it contains characters that are not allowed. How do I parse this file with either lxmlor ElementTree?

解决方案

You are using the decoded unicode value. Use r.raw raw response data instead:

r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)

which will read the data from the response directly; do note the stream=True option to .get().

Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.

You don't have to stream the response; for smaller XML documents it is fine to use the response.content attribute, which is the un-decoded response body:

r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)

XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.

这篇关于解析 XML 文件得到 UnicodeEncodeError (ElementTree)/ValueError (lxml)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆