如何使用反斜杠 x \x 代码解码 ascii 字符串 [英] how to decode an ascii string with backslash x \x codes

查看:320
本文介绍了如何使用反斜杠 x \x 代码解码 ascii 字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从巴西葡萄牙语文本中解码:

<块引用>

'Demais Subfun\xc3\xa7\xc3\xb5es 12'

应该是

<块引用>

'Demais Subfunções 12'

<代码>>>a.decode('unicode_escape')>>a.encode('unicode_escape')>>a.decode('ascii')>>a.encode('ascii')

都给:

UnicodeDecodeError: 'ascii' 编解码器无法解码位置 13 中的字节 0xc3:序号不在范围内(128)

另一方面,这给出了:

<代码>>>打印 a.encode('utf-8')Demais Subfun├â┬º├â┬Áes 12>>打印一个Demais Subfunções 12

解决方案

您有 ASCII 编码的二进制数据.\xhh 代码点表示您的数据使用不同的编解码器进行编码,并且您看到 Python 生成数据的表示使用 repr() 函数,该函数可以作为 Python 文字重新使用,准确地让您重新创建完全相同的值.这种表示在调试程序时非常有用.

换句话说,\xhh 转义序列代表单个字节,而 hh 是该字节的十六进制值.您有 4 个字节的十六进制值 C3、A7、C3 和 B5,它们不映射到可打印的 ASCII 字符,因此 Python 使用 \xhh 表示法.

您改为使用 UTF-8 数据,将其解码为:

<预><代码>>>>'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')u'Demais Subfun\xe7\xf5es 12'>>>打印 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')Demais Subfunções 12

C3 A7 字节一起编码 U+00E7 带有 CEDILLA 的拉丁文小写字母 C,而 C3 B5字节编码 U+00F5 带波浪号的拉丁文小写字母 O.

ASCII 恰好是 UTF-8 编解码器的一个子集,这就是为什么所有其他字母都可以在 Python repr() 输出中这样表示的原因.

I am trying to decode from a Brazilian Portogese text:

'Demais Subfun\xc3\xa7\xc3\xb5es 12'

It should be

'Demais Subfunções 12'

>> a.decode('unicode_escape')
>> a.encode('unicode_escape')
>> a.decode('ascii')
>> a.encode('ascii')

all give:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13:
ordinal not in range(128)

on the other hand this gives:

>> print a.encode('utf-8')
Demais Subfun├â┬º├â┬Áes 12

>> print a
Demais Subfunções 12

解决方案

You have binary data that is not ASCII encoded. The \xhh codepoints indicate your data is encoded with a different codec, and you are seeing Python produce a representation of the data using the repr() function that can be re-used as a Python literal that accurately lets you re-create the exact same value. This representation is very useful when debugging a program.

In other words, the \xhh escape sequences represent individual bytes, and the hh is the hex value of that byte. You have 4 bytes with hex values C3, A7, C3 and B5, that do not map to printable ASCII characters so Python uses the \xhh notation instead.

You instead have UTF-8 data, decode it as such:

>>> 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
u'Demais Subfun\xe7\xf5es 12'
>>> print 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
Demais Subfunções 12

The C3 A7 bytes together encode U+00E7 LATIN SMALL LETTER C WITH CEDILLA, while the C3 B5 bytes encode U+00F5 LATIN SMALL LETTER O WITH TILDE.

ASCII happens to be a subset of the UTF-8 codec, which is why all the other letters can be represented as such in the Python repr() output.

这篇关于如何使用反斜杠 x \x 代码解码 ascii 字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆