“代理逃生"无法逃脱某些字符 [英] "surrogateescape" cannot escape certain characters

查看:80
本文介绍了“代理逃生"无法逃脱某些字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于在Python中读取和写入文本文件,一个的主要Python贡献者提到了这与surrogateescape Unicode错误处理程序有关:

Regarding reading and writing text files in Python, one of the main Python contributors mentions this regarding the surrogateescape Unicode Error Handler:

[surrogateescape]通过将数据压缩在Unicode代码点空间的很少使用的部分中来处理解码错误.编码时,它将隐藏的值转换回无法正确解码的确切原始字节序列.

[surrogateescape] handles decoding errors by squirreling the data away in a little used part of the Unicode code point space. When encoding, it translates those hidden away values back into the exact original byte sequence that failed to decode correctly.

但是,在打开文件然后尝试将输出写入另一个文件时:

However, while opening a file and then attempting to write the output to another file:

input_file = open('someFile.txt', 'r', encoding="ascii", errors="surrogateescape")
output_file = open('anotherFile.txt', 'w')

for line in input_file:
    output_file.write(line)

结果:

  File "./break-50000.py", line 37, in main
    output_file.write(line)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 3: surrogates not allowed

请注意,输入文件是 not ASCII.但是,它会在将异常抛出到某一特定行之前,横穿数百行包含非ASCII字符的行.输出文件必须为ASCII,丢失一些字符就可以了.

Note that the input file is not ASCII. However, it transverses hundreds of lines that contain non-ASCII characters just fine before it throws the exception on one particular line. The output file must be ASCII and loosing some characters is just fine.

这是当解码为UTF-8时抛出错误的行:

This is the line that is throwing the error when decoded as UTF-8:

'佐伊咖啡屋'

'Zoë\'s Coffee House'

这是十六进制编码:

$ cat z.txt | hd
00000000  27 5a 6f c3 ab 5c 27 73  20 43 6f 66 66 65 65 20  |'Zo..\'s Coffee |
00000010  48 6f 75 73 65 27 0a                              |House'.|
00000017

为什么surrogateescape Unicode错误处理程序返回的字符不是 ASCII??这是Kubuntu Linux 12.10上的Python 3.2.3.

Why might the surrogateescape Unicode Error Handler be returning a character that is not ASCII? This is with Python 3.2.3 on Kubuntu Linux 12.10.

推荐答案

为什么surrogateescape Unicode错误处理程序返回的字符不是ASCII?

Why might the surrogateescape Unicode Error Handler be returning a character that is not ASCII?

因为这是它明确执行的操作.这样,您可以以其他方式使用相同的错误处理程序,它将知道该怎么办.

Because that's what it explicitly does. That way you can use the same error handler the other way and it will know what to do.

3>> b"'Zo\xc3\xab\\'s'".decode('ascii', errors='surrogateescape')
"'Zo\udcc3\udcab\\'s'"
3>> "'Zo\udcc3\udcab\\'s'".encode('ascii', errors='surrogateescape')
b"'Zo\xc3\xab\\'s'"

这篇关于“代理逃生"无法逃脱某些字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆