是否可以构造 utf-8 编解码器无法编码的 unicode 字符串? [英] Is it possible to construct a unicode string that the utf-8 codec cannot encode?

查看:59
本文介绍了是否可以构造 utf-8 编解码器无法编码的 unicode 字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以构造一个 utf-8 编解码器无法编码的 unicode 字符串?

来自文档(https://docs.python.org/2/library/codecs.html),看起来 utf-8 编解码器可以用任何语言"编码符号.文档还指出何时编解码器只能编码某些字符或只能编码基本多语言平面.我不知道这是否相当于说不可能构造一个不能使用 utf-8 编解码器转换为字节串的 unicode 值",然而.

这是 utf-8 编解码器的表条目.

<块引用>

编解码器别名目的

utf_8 U8、UTF、utf8 所有语言

这里的动机是我有一个实用函数,它接受一个 unicode 字符串或一个字节字符串,并将其转换为一个字节字符串.当给定一个字节字符串时,它是一个空操作.这个函数不应该抛出异常,除非它是用非字符串类型调用的,在这种情况下,它应该以 TypeError 的形式失败,稍后会被捕获并记录.(如果我们尝试插入到异常消息中的项目的 repr 太大,我们仍然会遇到问题,但现在让我们忽略它).

我正在使用 strict 设置,因为我希望此函数在遇到无法编码的 unicode 对象时抛出异常,但我希望这是不可能的.

def utf8_to_bytes(item):""获取一个字节或unicode对象并将其转换为字节,必要时使用utf-8"如果是实例(项目,字节):归还物品elif isinstance(item, unicode):返回 codecs.encode(item, 'utf-8', 'strict')别的:引发类型错误(项目必须是字节或 unicode.得到 %r"% 类型(项目))

解决方案

UTF-8 旨在对所有 Unicode 标准进行编码.将 Unicode 文本编码为 UTF-8 通常不会引发异常.

来自维基百科关于编解码器的文章:

<块引用>

UTF-8 是一种字符编码,能够对 Unicode 定义的所有可能的字符或代码点进行编码

Python 2 UTF-8 编码没有我所知道的边缘情况;非 BMP 数据和代理对的处理方式相同:

<预><代码>>>>导入系统>>>hex(sys.maxunicode) # 一个狭窄的 UCS-2 构建'0xffff'>>>len(u'\U0001F525')2>>>u'\U0001F525'.encode('utf-8')'\xf0\x9f\x94\xa5'>>>你'\ud83d\udd25'你'\U0001f525'>>>len(u'\ud83d\udd25')2>>>u'\ud83d\udd25'.encode('utf-8')'\xf0\x9f\x94\xa5'

注意 strict 是默认的编码模式.您也不需要使用 codecs 模块,只需在 unicode 对象上使用 encode 方法:

return item.encode('utf-8')

在 Python 3 中,情况稍微复杂一些.解码和编码 代理对受到限制;官方标准规定此类字符只能出现在 UTF-16 编码数据中,并且只能出现在低位和高位对中.

因此,您需要使用 surrogatepass 错误处理程序:

<块引用>

允许对代理代码进行编码和解码.这些编解码器通常将代理的存在视为错误.

surrogatepassstrict 之间的唯一区别是 surrogatepass 将允许您将 Unicode 文本中的任何代理代码点编码为 UTF-8.您只会在极少数情况下获得此类数据(定义为文字,或者在 UTF-16 中意外地将此类代码点未配对,然后使用 surrogatepass 进行解码).

因此,在 Python 3 中,只有当您有可能使用 surrogatepass 解码或从文字数据生成 Unicode 文本时,您才需要使用 item.encode('utf8', 'surrogatepass') 绝对确定所有可能的 Unicode 值都可以编码.

Is it possible to construct a unicode string that the utf-8 codec cannot encode?

From the documentation (https://docs.python.org/2/library/codecs.html), it appears that the utf-8 codec can encode a symbol in "any language". The docs also note when a codec can only encode certain characters or only the Basic Multilingual Plane. I don't know whether this is equivalent to saying "it is impossible to construct a unicode value that cannot be converted to a bytestring using the utf-8 codec", however.

Here's the table entry for the utf-8 codec.

Codec Aliases Purpose

utf_8 U8, UTF, utf8 all languages

The motivation here is that I have a utility function that takes either a unicode string or a byte string and converts it to a byte string. When given a byte string it is a no-op. This function is not supposed to throw an exception unless it is called with a non-string type and in that case it's supposed to fail informatively with a TypeError that will be caught later and logged. (We can still run into problems if the repr of the item we attempted to insert into the exception message is too big, but let's ignore that for now).

I'm using the strict setting because I want this function to throw an exception in the event that it encounters a unicode object that it cannot encode, but am hoping that that isn't possible.

def utf8_to_bytes(item):
    """take a bytes or unicode object and convert it to bytes,
    using utf-8 if necessary"""
    if isinstance(item, bytes):
        return item
    elif isinstance(item, unicode):
        return codecs.encode(item, 'utf-8', 'strict')
    else:
        raise TypeError("item must be bytes or unicode. got %r" % type(item))

解决方案

UTF-8 is designed to encode all of the Unicode standard. Encoding Unicode text to UTF-8 will not normally throw an exception.

From the Wikipedia article on the codec:

UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode

The Python 2 UTF-8 encoding has no edge-cases that I know of; non-BMP data and surrogate pairs are handled just the same:

>>> import sys
>>> hex(sys.maxunicode)  # a narrow UCS-2 build
'0xffff'
>>> len(u'\U0001F525')
2
>>> u'\U0001F525'.encode('utf-8')
'\xf0\x9f\x94\xa5'
>>> u'\ud83d\udd25'
u'\U0001f525'
>>> len(u'\ud83d\udd25')
2
>>> u'\ud83d\udd25'.encode('utf-8')
'\xf0\x9f\x94\xa5'

Note that strict is the default encoding mode. You don't need to use the codecs module either, just use the encode method on the unicode object:

return item.encode('utf-8')

In Python 3, the situation is slightly more complicated. Decoding and encoding surrogate pairs is restricted; the official standard states such characters should only ever appear in UTF-16 encoded data, and then only in a low and high pair.

As such, you need to explicitly state that you want to support such codepoints with the surrogatepass error handler:

Allow encoding and decoding of surrogate codes. These codecs normally treat the presence of surrogates as an error.

The only difference between surrogatepass and strict is that surrogatepass will allow you to encode any surrogate codepoints in your Unicode text to UTF-8. You'd only get such data in rare circumstances (defined as literals, or when accidentally leaving such codepoints unpaired in UTF-16 and then decoding using surrogatepass).

So, in Python 3, only if you there is a chance your Unicode text could have been produced with a surrogatepass decode or from literal data, you'd need to use item.encode('utf8', 'surrogatepass') to be absolutely certain all possible Unicode values can be encoded.

这篇关于是否可以构造 utf-8 编解码器无法编码的 unicode 字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆