为什么ElementTree拒绝具有“编码不正确”的UTF-16 XML声明? [英] Why does ElementTree reject UTF-16 XML declarations with "encoding incorrect"?

查看:777
本文介绍了为什么ElementTree拒绝具有“编码不正确”的UTF-16 XML声明?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python 2.7中,当将一个unicode字符串传递给ElementTree的 fromstring()方法时,它具有 encoding =UTF-16在XML声明中,我得到一个ParseError表示指定的编码是不正确的:

In Python 2.7, when passing a unicode string to ElementTree's fromstring() method that has encoding="UTF-16" in the XML declaration, I'm getting a ParseError saying that the encoding specified is incorrect:

>>> from xml.etree import ElementTree
>>> data = u'<?xml version="1.0" encoding="utf-16"?><root/>'
>>> ElementTree.fromstring(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1300, in XML
    parser.feed(text)
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: encoding specified in XML declaration is incorrect: line 1, column 30

这意味着什么?什么使ElementTree这样认为?

What does that mean? What makes ElementTree think so?

毕竟,我传入unicode码点,而不是一个字节字符串。这里没有编码。如何不正确?

After all, I'm passing in unicode codepoints, not a byte string. There is no encoding involved here. How can it be incorrect?

当然,可以认为任何编码都不正确,因为这些unicode代码点不被编码。然而,为什么UTF-8不会被拒绝为不正确的编码?

Of course, one could argue that any encoding is incorrect, as these unicode codepoints are not encoded. However, then why is UTF-8 not rejected as "incorrect encoding"?

>>> ElementTree.fromstring(u'<?xml version="1.0" encoding="utf-8"?><root/>')

我可以通过将unicode字符串编码为UTF-16编码的字节字符串并将其传递到 fromstring()或通过在unicode中替换 encoding =utf-16 encoding =utf-8字符串,但我想了解为什么会引发异常。关于ElementTree的文档说明了没有任何关于只接受字节字符串。

I can solve this problem easily either by encoding the unicode string into a UTF-16-encoded byte string and passing that to fromstring() or by replacing encoding="utf-16" with encoding="utf-8" in the unicode string, but I would like to understand why that exception is raised. The documentation of ElementTree says nothing about only accepting byte strings.

具体来说,我想避免这些额外的操作,因为我的输入数据可以变得相当大,我想避免让他们两次内存和处理它们的CPU开销不仅仅是绝对必要的。

Specifically, I would like to avoid these additional operations because my input data can get quite big and I would like to avoid having them twice in memory and the CPU overhead of processing them more than absolutely necessary.

推荐答案

我不会试图证明行为,但是为了解释为什么它正在以代码的形式发生。

I'm not going to try to justify the behavior, but to explain why it's actually happening with the code as written.

简而言之:Python使用的XML解析器, expat ,操作字节,而不是unicode字符。你必须调用 .encode('utf-16-be') .encode('utf-16-le')在你将它传递给 ElementTree.fromstring 之前的字符串:

In short: the XML parser that Python uses, expat, operates on bytes, not unicode characters. You MUST call .encode('utf-16-be') or .encode('utf-16-le') on the string before you pass it to ElementTree.fromstring:

ElementTree.fromstring(data.encode('utf-16-be'))






证明: ElementTree.fromstring 最终调用 pyexpat.xmlparser.Parse ,其中在pyexpat.c中实现:


Proof: ElementTree.fromstring eventually calls down into pyexpat.xmlparser.Parse, which is implemented in pyexpat.c:

static PyObject *
xmlparse_Parse(xmlparseobject *self, PyObject *args)
{
    char *s;
    int slen;
    int isFinal = 0;

    if (!PyArg_ParseTuple(args, "s#|i:Parse", &s, &slen, &isFinal))
        return NULL;

    return get_parse_result(self, XML_Parse(self->itself, s, slen, isFinal));
}

所以你传递的unicode参数使用小号# PyArg_ParseTuple 文档说:

So the unicode parameter you passed in gets converted using s#. The docs for PyArg_ParseTuple say:


s#(字符串,Unicode或任何读缓冲区兼容对象)[const char
*,int(或Py_ssize_t,见下文)]这个变种在s存储成两个C变量,第一个是一个指向一个字符串的指针,第二个
一个是它的长度。在这种情况下,Python字符串可能包含嵌入的
空字节。如果可以进行这样的转换,Unicode对象将返回指向默认编码的
字符串版本的对象的指针。所有
其他读缓冲区兼容的对象传回一个对原始
内部数据表示的引用。

s# (string, Unicode or any read buffer compatible object) [const char *, int (or Py_ssize_t, see below)] This variant on s stores into two C variables, the first one a pointer to a character string, the second one its length. In this case the Python string may contain embedded null bytes. Unicode objects pass back a pointer to the default encoded string version of the object if such a conversion is possible. All other read-buffer compatible objects pass back a reference to the raw internal data representation.

检查出来:

from xml.etree import ElementTree
data = u'<?xml version="1.0" encoding="utf-8"?><root>\u2163</root>'
print ElementTree.fromstring(data)

给出错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2163' in position 44: ordinal not in range(128)

这意味着当您指定 encoding =utf-8时,您只是幸运的是,当Unicode字符串被编码为输入时,没有非ASCII字符ASCII。如果您在解析之前添加以下内容,UTF-8将按照预期的方式工作:

which means that when you were specifying encoding="utf-8", you were just getting lucky that there weren't non-ASCII characters in your input when the Unicode string got encoded to ASCII. If you add the following before you parse, UTF-8 works as expected with that example:

import sys
reload(sys).setdefaultencoding('utf8')

然而,它不起作用来设置defaultencoding到'utf-16-be'或'utf-16-le',因为ElementTree的Python位做直接字符串比较,它们在UTF-16地区开始失败。

however, it doesn't work to set the defaultencoding to 'utf-16-be' or 'utf-16-le', since the Python bits of ElementTree do direct string comparisons which start to fail in UTF-16 land.

这篇关于为什么ElementTree拒绝具有“编码不正确”的UTF-16 XML声明?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆