xml.etree.ElementTree.ParseError:格式不正确(令牌无效) [英] xml.etree.ElementTree.ParseError: not well-formed (invalid token)

查看:92
本文介绍了xml.etree.ElementTree.ParseError:格式不正确(令牌无效)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 Python 3

我们得到的错误:

文件C:/scratch.py​​",第 27 行,运行中树 = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))文件C:\Programs\Python\Python36-32\lib\xml\etree\ElementTree.py",第 1314 行,XMLparser.feed(文本)xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 163, column 1106

我们的代码:

tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))对于我在 tree.iter('item'):尝试:title = i.find('title').text除了例外:经过

响应[0] 来自正在返回的 url get 请求列表,但在这种情况下索引为 0,在一个特定的 url 上进行测试:http://feeds.feedburner.com/marginalrevolution/feed

我们能够将 XML 代码插入 W3 School 验证器并得到:

此页面包含以下错误:第 163 行第 31 列的错误:输入不是正确的 UTF-8,指示编码!字节:0x0C 0x66 0x69 0x67

但是使用 ET.XMLParser(encoding='utf-8') 属性,这不应该解决解析时的错误吗?

解决方案

W3 Schools 验证器的错误消息具有误导性.0x0c 的问题不在于它是无效的 UTF-8,而是它不是一个 合法字符 在 XML 中.

0x0cform feed 控制字符,因此它在文档中的存在没有用.符合标准的 XML 解析器必须拒绝格式不正确的文档,并且您无法更改 rss 提要,因此最简单的解决方案是在处理之前将其从文档中删除.

<预><代码>>>>树 = ET.fromstring(original_response, ET.XMLParser(encoding='utf-8'))回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/usr/local/lib/python3.7/xml/etree/ElementTree.py",第 1315 行,XMLparser.feed(文本)xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 185, column 1106>>>固定 = original_response.replace(b'\x0c', b'')>>>tree = ET.fromstring(fixed, ET.XMLParser(encoding='utf-8'))>>>树<0x7ff316db6278 处的元素rss">

Using Python 3

Error we get:

File "C:/scratch.py", line 27, in run
    tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
  File "C:\Programs\Python\Python36-32\lib\xml\etree\ElementTree.py", line 1314, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 163, column 1106

Our code:

tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
    for i in tree.iter('item'):
        try:
            title = i.find('title').text
        except Exception:
            pass

The responses[0] is from a list of url get requests being returned, but in this case of index 0, testing on one specific url: http://feeds.feedburner.com/marginalrevolution/feed

We were able to plug in the XML code to W3 School validator and got:

This page contains the following errors:
error on line 163 at column 31: Input is not in proper UTF-8, indicate encoding! Bytes: 0x0C 0x66 0x69 0x67

But with the ET.XMLParser(encoding='utf-8') property, shouldn't this fix the error when parsing?

解决方案

The error message W3 Schools validator is misleading. The problem with 0x0c is not that it is invalid UTF-8, it's that it is not a legal character in XML.

0x0c is the form feed control character, so its presence in the document isn't useful. Conforming XML parsers are obliged to reject documents that are not well formed, and you cannot change the rss feed, so the simplest solution is to remove it from the document before processing.

>>> tree = ET.fromstring(original_response, ET.XMLParser(encoding='utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1315, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 185, column 1106

>>> fixed = original_response.replace(b'\x0c', b'')
>>> tree = ET.fromstring(fixed, ET.XMLParser(encoding='utf-8'))
>>> tree
<Element 'rss' at 0x7ff316db6278>

这篇关于xml.etree.ElementTree.ParseError:格式不正确(令牌无效)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆