为什么 python decode 替换了编码字符串中的无效字节? [英] Why is python decode replacing more than the invalid bytes from an encoded string?

查看:51
本文介绍了为什么 python decode 替换了编码字符串中的无效字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试解码无效编码的 utf-8 html 页面会得到不同的结果蟒蛇,火狐和铬.

来自测试页的无效编码片段看起来像 'PREFIX\xe3\xabSUFFIX'

<预><代码>>>>片段 = 'PREFIX\xe3\xabSUFFIX'>>>fragment.decode('utf-8', 'strict')...UnicodeDecodeError: 'utf8' 编解码器无法解码位置 6-8 中的字节:无效数据

更新:这个问题在 错误报告 到 Python unicode 组件.据报道,该问题已在 Python 2.7.11 和 3.5.2 中修复.

<小时>

接下来是用于处理解码错误的替换策略Python、Firefox 和 Chrome.请注意它们的不同之处,特别是python 内置删除有效的 S(加上无效的字节序列).

Python

内置的 replace 错误处理程序替换无效的 \xe3\xab 加上S from SUFFIX by U+FFFD

<预><代码>>>>fragment.decode('utf-8', 'replace')u'PREFIX\ufffdUFFIX'>>>打印 _前缀 后缀

浏览器

要测试浏览器如何解码无效字节序列将使用 cgi 脚本:

#!/usr/bin/env python打印 """\内容类型:文本/纯文本;字符集=utf-8PREFIX\xe3\xabSUFFIX"""

Firefox 和 Chrome 浏览器呈现:

前缀 后缀

<小时>

为什么 str.decode 的内置 replace 错误处理程序会从 SUFFIX

中删除 S

(是更新 1)

根据维基百科 UTF-8(感谢 mjv),以下字节范围用于指示序列的开始字节

  • 0xC2-0xDF:2 字节序列的开始
  • 0xE0-0xEF:3 字节序列的开始
  • 0xF0-0xF4:4 字节序列的开始

'PREFIX\xe3\abSUFFIX' 测试片段有0xE3,它指示python解码器后面跟着一个 3 字节的序列,发现该序列无效,python解码器忽略包括 '\xabS' 在内的整个序列,并在它之后继续忽略从中间开始的任何可能的正确序列.

这意味着对于像 '\xF0SUFFIX' 这样的无效编码序列,它会解码 u'\ufffdFIX' 而不是 u'\ufffdSUFFIX'.

示例 1:介绍 DOM 解析错误

<预><代码>>>>'<div>\xf0<div>价格:$20</div>...</div>'.decode('utf-8', 'replace')u'<div>\ufffdv>价格:$20</div>...</div>'>>>打印 _<div> v>价格:$20</div>...</div>

示例 2:安全问题(另见 Unicode 安全注意事项):

<预><代码>>>>'\xf0<!-- <script>alert("hi!");</script>-->'.decode('utf-8', 'replace')u'\ufffd- <script>alert("hi!");</script>-->'>>>打印 _- <script>alert("hi!");</script>-->

示例 3:删除抓取应用程序的有效信息

<预><代码>>>>'\xf0' + u'it\u2019s'.encode('utf-8') # "it's"'\xf0it\xe2\x80\x99s'>>>_.decode('utf-8', '替换')你'\ufffd\ufffd\ufffds'>>>打印 _

使用 cgi 脚本在浏览器中呈现:

#!/usr/bin/env python打印 """\内容类型:文本/纯文本;字符集=utf-8\xf0it\xe2\x80\x99s"""

渲染:

——它是

<小时>

是否有任何官方推荐的方法来处理解码替换?

(是更新 2)

公开审查中,Unicode 技术委员会选择了选项 2以下候选人:

  1. 用一个 U+FFFD 替换整个格式错误的子序列.
  2. 用单个 U+FFFD 替换格式错误的子序列的每个最大子部分.
  3. 用单个 U+FFFD 替换格式错误的子序列的每个代码单元.

UTC 解析时间为 2008-08-29,来源:http://www.unicode.org/review/resolved-pri-100.html

UTC Public Review 121 还包括以无效字节流为例'\x61\xF1\x80\x80\xE1\x80\xC2\x62',显示每个解码结果选项.

 61 F1 80 80 E1 80 C2 621 U+0061 U+FFFD U+00622 U+0061 U+FFFD U+FFFD U+FFFD U+00623 U+0061 U+FFFD U+FFFD U+FFFD U+FFFD U+FFFD U+FFFD U+0062

在纯 Python 中,三个结果是:

  1. u'a\ufffdb' 显示为 a b
  2. u'a\ufffd\ufffd\ufffdb' 显示为 a b
  3. u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb' 显示为 a b

这是python对无效示例字节流所做的:

<预><代码>>>>'\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace')你\ufffd\ufffd\ufffd'>>>打印 _

再次使用 cgi 脚本来测试浏览器如何呈现错误编码的字节:

#!/usr/bin/env python打印 """\内容类型:文本/纯文本;字符集=utf-8\x61\xF1\x80\x80\xE1\x80\xC2\x62"""

Chrome 和 Firefox 均呈现:

a   b

请注意,浏览器呈现的结果与 PR121 推荐的选项 2 匹配

虽然选项 3 在 python 中看起来很容易实现,但选项 2 和 1 是一个挑战.

<预><代码>>>>replace_option3 = lambda exc: (u'\ufffd', exc.start+1)>>>codecs.register_error('replace_option3', replace_option3)>>>'\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace_option3')u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb'>>>打印 _a b

解决方案

你知道你的 S 是有效的,有先见之明和事后诸葛亮的好处 :-) 假设最初有一个合法的 3 字节 UTF-8那里的序列,并且第三个字节在传输中被破坏了......随着你提到的变化,你会抱怨一个虚假的 S 没有被替换.如果没有纠错码、水晶球或 铃鼓.

更新

正如@mjv 所说,UTC 问题是关于应该包含多少 U+FFFD.

事实上,Python 没有使用 UTC 的 3 个选项中的任何一个.

以下是 UTC 的唯一示例:

 61 F1 80 80 E1 80 C2 621 U+0061 U+FFFD U+00622 U+0061 U+FFFD U+FFFD U+FFFD U+00623 U+0061 U+FFFD U+FFFD U+FFFD U+FFFD U+FFFD U+FFFD U+0062

Python 的作用如下:

<预><代码>>>>坏 = '\x61\xf1\x80\x80\xe1\x80\xc2\x62cdef'>>>bad.decode('utf8', 'replace')你\ufffd\ufffd\ufffdcdef'>>>

为什么?

F1 应该开始一个 4 字节的序列,但 E1 无效.一个错误的序列,一个替换.
从下一个字节重新开始,第 3 个 80.Bang,另一个 FFFD.
再次从C2开始,引入了2字节序列,但是C2 62无效,所以再次bang.

有趣的是,UTC 没有提到 Python 正在做什么(在前导字符指示的字节数之后重新启动).也许这在 Unicode 标准中的某处实际上是被禁止或弃用的.需要更多阅读.观看此空间.

更新 2 休斯顿,我们遇到了问题.

=== 引自 Unicode 5.2 第 3 章 ===

转换过程的限制

不将字符串中任何格式错误的代码单元子序列解释为字符的要求(参见一致性条款 C10)对转换过程具有重要意义.

例如,此类进程可能会将 UTF-8 代码单元序列解释为 Unicode 字符序列.如果转换器遇到格式错误的 UTF-8 代码单元序列,以有效的第一个字节开始,但不以有效的后继字节继续(参见表 3-7),它不能将后继字节作为格式错误的子序列的一部分每当这些后继字节本身构成格式良好的 UTF-8 代码的一部分时单元子序列.

如果 UTF-8 转换过程的实现在遇到第一个错误时停止,不报告任何格式错误的 UTF-8 代码单元子序列的结尾,则要求几乎没有实际差异.但是,该要求确实引入了一个如果 UTF-8 转换器继续超过检测到的错误点,则存在重大约束,也许通过用一个或多个 U+FFFD 替换字符代替无法解释的字符,格式错误的 UTF-8 代码单元子序列.例如,输入 UTF-8 代码单元序列,这样的UTF-8转换过程一定不能返回 ,因为这些输出中的任何一个都是将格式良好的子序列误解为格式错误的子序列的一部分的结果.这此类进程的预期返回值将改为 .

对于使用有效后继字节的 UTF-8 转换过程来说,不仅不符合,但也让转换器容易受到安全漏洞的攻击.请参阅 Unicode 技术报告#36,Unicode 安全注意事项."

=== 引用结束 ===

然后通过示例详细讨论要发出多少 FFFD"问题.

在最后引用的第二段中使用他们的例子:

<预><代码>>>>bad2 = "\xc2\x41\x42">>>bad2.decode('utf8', '替换')你'\ufffdB'# 失败

请注意,这是 str.decode('utf_8') 的 'replace' 'ignore' 选项的问题-- 这完全是关于省略数据,而不是关于发射了多少 U+FFFD;正确处理数据发射部分,U+FFFD 问题自然会出现,如我未引用的部分所述.

更新 3 当前版本的 Python(包括 2.7)将 unicodedata.unidata_version 设为 '5.1.0',这可能会也可能不会表明与 Unicode 相关的代码是旨在符合 Unicode 5.1.0.无论如何,直到 5.2.0 才出现在 Unicode 标准中的关于 Python 正在做什么的冗长禁令.我将在 Python 跟踪器上提出一个问题,但不会提及 'oht'.encode('rot13') 这个词.

报告此处

Trying to decode an invalid encoded utf-8 html page gives different results in python, firefox and chrome.

The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX'

>>> fragment = 'PREFIX\xe3\xabSUFFIX'
>>> fragment.decode('utf-8', 'strict')
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data

UPDATE: This question concluded in a bug report to Python unicode component. The Issue is reported to be fixed in Python 2.7.11 and 3.5.2.


What follows is the replacement policies used to handle decoding errors in Python, Firefox and Chrome. Note how they differs, and specially how python builtin removes the valid S (plus the invalid sequence of bytes).

Python

The builtin replace error handler replaces the invalid \xe3\xab plus the S from SUFFIX by U+FFFD

>>> fragment.decode('utf-8', 'replace')
u'PREFIX\ufffdUFFIX'
>>> print _
PREFIX�UFFIX

Browsers

To tests how browsers decode the invalid sequence of bytes will use a cgi script:

#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8

PREFIX\xe3\xabSUFFIX"""

Firefox and Chrome browsers rendered:

PREFIX�SUFFIX


Why builtin replace error handler for str.decode is removing the S from SUFFIX

(Was UPDATE 1)

According to wikipedia UTF-8 (thanks mjv), the following ranges of bytes are used to indicate the start of a sequence of bytes

  • 0xC2-0xDF : Start of 2-byte sequence
  • 0xE0-0xEF : Start of 3-byte sequence
  • 0xF0-0xF4 : Start of 4-byte sequence

'PREFIX\xe3\abSUFFIX' test fragment has 0xE3, it instructs python decoder that a 3-byte sequence follows, the sequence is found invalid and python decoder ignores the whole sequence including '\xabS', and continues after it ignoring any possible correct sequence starting in the middle.

This means that for an invalid encoded sequence like '\xF0SUFFIX', it will decode u'\ufffdFIX' instead of u'\ufffdSUFFIX'.

Example 1: Introducing DOM parsing bugs

>>> '<div>\xf0<div>Price: $20</div>...</div>'.decode('utf-8', 'replace')
u'<div>\ufffdv>Price: $20</div>...</div>'
>>> print _
<div>�v>Price: $20</div>...</div>

Example 2: Security issues (Also see Unicode security considerations):

>>> '\xf0<!-- <script>alert("hi!");</script> -->'.decode('utf-8', 'replace')
u'\ufffd- <script>alert("hi!");</script> -->'
>>> print _
�- <script>alert("hi!");</script> -->

Example 3: Remove valid information for a scraping application

>>> '\xf0' + u'it\u2019s'.encode('utf-8') # "it’s"
'\xf0it\xe2\x80\x99s'
>>> _.decode('utf-8', 'replace')
u'\ufffd\ufffd\ufffds'
>>> print _
���s

Using a cgi script to render this in browsers:

#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8

\xf0it\xe2\x80\x99s"""

Rendered:

�it’s


Is there any official recommended way for handling decoding replacements?

(Was UPDATE 2)

In a public review, the Unicode Technical Committee has opted for option 2 of the following candidates:

  1. Replace the entire ill-formed subsequence by a single U+FFFD.
  2. Replace each maximal subpart of the ill-formed subsequence by a single U+FFFD.
  3. Replace each code unit of the ill-formed subsequence by a single U+FFFD.

UTC Resolution was at 2008-08-29, source: http://www.unicode.org/review/resolved-pri-100.html

UTC Public Review 121 also includes an invalid bytestream as example '\x61\xF1\x80\x80\xE1\x80\xC2\x62', it shows decoding results for each option.

            61      F1      80      80      E1      80      C2      62
      1   U+0061  U+FFFD                                          U+0062
      2   U+0061  U+FFFD                  U+FFFD          U+FFFD  U+0062
      3   U+0061  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+0062

In plain Python the three results are:

  1. u'a\ufffdb' shows as a�b
  2. u'a\ufffd\ufffd\ufffdb' shows as a���b
  3. u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb' shows as a������b

And here is what python does for the invalid example bytestream:

>>> '\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace')
u'a\ufffd\ufffd\ufffd'
>>> print _
a���

Again, using a cgi script to test how browsers render the buggy encoded bytes:

#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8

\x61\xF1\x80\x80\xE1\x80\xC2\x62"""

Both, Chrome and Firefox rendered:

a���b

Note that browsers rendered result matches option 2 of PR121 recomendation

While option 3 looks easily implementable in python, option 2 and 1 are a challenge.

>>> replace_option3 = lambda exc: (u'\ufffd', exc.start+1)
>>> codecs.register_error('replace_option3', replace_option3)
>>> '\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace_option3')
u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb'
>>> print _
a������b

解决方案

You know that your S is valid, with the benefit of both look-ahead and hindsight :-) Suppose there was originally a legal 3-byte UTF-8 sequence there, and the 3rd byte was corrupted in transmission ... with the change that you mention, you'd be complaining that a spurious S had not been replaced. There is no "right" way of doing it, without the benefit of error-correcting codes, or a crystal ball, or a tamborine.

Update

As @mjv remarked, the UTC issue is all about how many U+FFFD should be included.

In fact, Python is not using ANY of the UTC's 3 options.

Here is the UTC's sole example:

      61      F1      80      80      E1      80      C2      62
1   U+0061  U+FFFD                                          U+0062
2   U+0061  U+FFFD                  U+FFFD          U+FFFD  U+0062
3   U+0061  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+0062

Here is what Python does:

>>> bad = '\x61\xf1\x80\x80\xe1\x80\xc2\x62cdef'
>>> bad.decode('utf8', 'replace')
u'a\ufffd\ufffd\ufffdcdef'
>>>

Why?

F1 should start a 4-byte sequence, but the E1 is not valid. One bad sequence, one replacement.
Start again at the next byte, the 3rd 80. Bang, another FFFD.
Start again at the C2, which introduces a 2-byte sequence, but C2 62 is invalid, so bang again.

It's interesting that the UTC didn't mention what Python is doing (restarting after the number of bytes indicated by the lead character). Perhaps this is actually forbidden or deprecated somewhere in the Unicode standard. More reading required. Watch this space.

Update 2 Houston, we have a problem.

=== Quoted from Chapter 3 of Unicode 5.2 ===

Constraints on Conversion Processes

The requirement not to interpret any ill-formed code unit subsequences in a string as characters (see conformance clause C10) has important consequences for conversion processes.

Such processes may, for example, interpret UTF-8 code unit sequences as Unicode character sequences. If the converter encounters an ill-formed UTF-8 code unit sequence which starts with a valid first byte, but which does not continue with valid successor bytes (see Table 3-7), it must not consume the successor bytes as part of the ill-formed subsequence whenever those successor bytes themselves constitute part of a well-formed UTF-8 code unit subsequence.

If an implementation of a UTF-8 conversion process stops at the first error encountered, without reporting the end of any ill-formed UTF-8 code unit subsequence, then the requirement makes little practical difference. However, the requirement does introduce a significant constraint if the UTF-8 converter continues past the point of a detected error, perhaps by substituting one or more U+FFFD replacement characters for the uninterpretable, ill-formed UTF-8 code unit subsequence. For example, with the input UTF-8 code unit sequence <C2 41 42>, such a UTF-8 conversion process must not return <U+FFFD> or <U+FFFD, U+0042>, because either of those outputs would be the result of misinterpreting a well-formed subsequence as being part of the ill-formed subsequence. The expected return value for such a process would instead be <U+FFFD, U+0041, U+0042>.

For a UTF-8 conversion process to consume valid successor bytes is not only non-conformant, but also leaves the converter open to security exploits. See Unicode Technical Report #36, "Unicode Security Considerations."

=== End of quote ===

It then goes on to discuss at length, with examples, the "how many FFFD to emit" issue.

Using their example in the 2nd last quoted paragraph:

>>> bad2 = "\xc2\x41\x42"
>>> bad2.decode('utf8', 'replace')
u'\ufffdB'
# FAIL

Note that this is a problem with both the 'replace' and 'ignore' options of str.decode('utf_8') -- it's all about omitting data, not about how many U+FFFD are emitted; get the data-emitting part right and the U+FFFD issue falls out naturally, as explained in the part that I didn't quote.

Update 3 Current versions of Python (including 2.7) have unicodedata.unidata_version as '5.1.0' which may or may not indicate that the Unicode-related code is intended to conform to Unicode 5.1.0. In any case, the wordy prohibition of what Python is doing didn't appear in the Unicode standard until 5.2.0. I'll raise an issue on the Python tracker without mentioning the word 'oht'.encode('rot13').

Reported here

这篇关于为什么 python decode 替换了编码字符串中的无效字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆