解析 XML 文件得到 UnicodeEncodeError (ElementTree)/ValueError (lxml) [英] parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

查看：17 发布时间：2021/12/31 19:49:51 python xml python-requests lxml elementtree

本文介绍了解析 XML 文件得到 UnicodeEncodeError (ElementTree)/ValueError (lxml)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我向 CareerBuilder API 发送了一个 GET 请求:

导入请求url = "http://api.careerbuilder.com/v1/jobsearch"有效载荷 = {'DeveloperKey': 'MY_DEVLOPER_KEY','JobTitle':'生物学家'}r = requests.get(url, params=payload)xml = r.text

并返回一个类似于这个的 XML.但是，我无法解析它.

使用任一 lxml

<预><代码>>>>从 lxml 导入 etree>>>打印 etree.fromstring(xml)回溯(最近一次调用最后一次):文件<pyshell#4>"，第 1 行，在 <module> 中打印 etree.fromstring(xml)文件lxml.etree.pyx"，第 2992 行，在 lxml.etree.fromstring (srclxmllxml.etree.c:62311)文件parser.pxi"，第 1585 行，在 lxml.etree._parseMemoryDocument (srclxmllxml.etree.c:91625)ValueError: 不支持带有编码声明的 Unicode 字符串.

或 ElementTree:

回溯(最近一次调用最后一次):文件<pyshell#3>"，第 1 行，在 <module> 中打印 ET.fromstring(xml)文件C:Python27libxmletreeElementTree.py"，第 1301 行，XMLparser.feed(文本)文件C:Python27libxmletreeElementTree.py"，第 1641 行，在提要中self._parser.Parse(数据，0)UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 3717: ordinal not in range(128)

所以，即使 XML 文件以

开头

我的印象是它包含不允许的字符.如何使用 lxml 或 ElementTree 解析此文件?

解决方案

您正在使用解码 unicode 值.使用 r.raw 原始响应数据代替:

r = requests.get(url, params=payload, stream=True)r.raw.decode_content = 真etree.parse(r.raw)

它将直接从响应中读取数据；请注意 .get() 的 stream=True 选项.

设置 r.raw.decode_content = True 标志可确保原始套接字将为您提供解压缩的内容，即使响应是 gzip 或 deflate 压缩的.

您不必流式传输响应；对于较小的 XML 文档，可以使用响应.content 属性，即未解码的响应体:

r = requests.get(url, params=payload)xml = etree.fromstring(r.content)

XML 解析器总是期望字节作为输入，因为 XML 格式本身决定了解析器如何将这些字节解码为 Unicode 文本.

I send a GET request to the CareerBuilder API :

import requests

url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
           'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text

And get back an XML that looks like this. However, I have trouble parsing it.



Using either lxml
>>> from lxml import etree
>>> print etree.fromstring(xml)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    print etree.fromstring(xml)
  File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (srclxmllxml.etree.c:62311)
  File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (srclxmllxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.
or ElementTree:
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    print ET.fromstring(xml)
  File "C:Python27libxmletreeElementTree.py", line 1301, in XML
    parser.feed(text)
  File "C:Python27libxmletreeElementTree.py", line 1641, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 3717: ordinal not in range(128)
So, even though the XML file starts with
<?xml version="1.0" encoding="UTF-8"?>
I have the impression that it contains characters that are not allowed. How do I parse this file with either lxmlor ElementTree?
 解决方案 
You are using the decoded unicode value. Use r.raw raw response data instead:
r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)
which will read the data from the response directly; do note the stream=True option to .get().

Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.

You don't have to stream the response; for smaller XML documents it is fine to use the response.content attribute, which is the un-decoded response body:
r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)
XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.

                        这篇关于解析 XML 文件得到 UnicodeEncodeError (ElementTree)/ValueError (lxml)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

解析 XML 文件得到 UnicodeEncodeError (ElementTree)/ValueError (lxml) [英] parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

解析 XML 文件得到 UnicodeEncodeError (ElementTree)/ValueError (lxml) [英] parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭