Python 3: os.walk() 文件路径 UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed [英] Python 3: os.walk() file paths UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed
问题描述
此代码:
用于 os.walk('.') 中的根、目录、文件:打印(根)
给我这个错误:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed
如何遍历文件树而不会得到这样的有毒字符串?
在 Linux 上,文件名只是一堆字节",不一定以特定编码进行编码.Python 3 尝试将所有内容转换为 Unicode 字符串.在这样做的过程中,开发人员提出了一种方案,可以将字节字符串转换为 Unicode 字符串并返回而不会丢失,并且不知道原始编码.他们使用部分代理来编码坏"字节,但在打印到终端时,普通的 UTF8 编码器无法处理它们.
例如,这是一个非 UTF8 字节字符串:
<预><代码>>>>b'C\xc3N'.decode('utf8','surrogateescape')'C\udcc3N'它可以在不丢失的情况下与 Unicode 相互转换:
<预><代码>>>>b'C\xc3N'.decode('utf8','surrogateescape').encode('utf8','surrogateescape')b'C\xc3N'但是无法打印:
<预><代码>>>>打印(b'C\xc3N'.decode('utf8','surrogateescape'))回溯(最近一次调用最后一次):文件<stdin>",第 1 行,位于 <module>UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 1: surrogates not allowed你必须弄清楚你想用非默认编码的文件名做什么.也许只是将它们编码回原始字节并使用未知替换对其进行解码.使用它来显示但保留原始名称以访问文件.
<预><代码>>>>b'C\xc3N'.decode('utf8','replace')网络os.walk
也可以接受一个字节串并且将返回字节串而不是 Unicode 串:
for p,d,f in os.walk(b'.'):
然后你就可以随意解码了.
This code:
for root, dirs, files in os.walk('.'):
print(root)
Gives me this error:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed
How do I walk through a file tree without getting toxic strings like this?
On Linux, filenames are 'just a bunch of bytes', and are not necessarily encoded in a particular encoding. Python 3 tries to turn everything into Unicode strings. In doing so the developers came up with a scheme to translate byte strings to Unicode strings and back without loss, and without knowing the original encoding. They used partial surrogates to encode the 'bad' bytes, but the normal UTF8 encoder can't handle them when printing to the terminal.
For example, here's a non-UTF8 byte string:
>>> b'C\xc3N'.decode('utf8','surrogateescape')
'C\udcc3N'
It can be converted to and from Unicode without loss:
>>> b'C\xc3N'.decode('utf8','surrogateescape').encode('utf8','surrogateescape')
b'C\xc3N'
But it can't be printed:
>>> print(b'C\xc3N'.decode('utf8','surrogateescape'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 1: surrogates not allowed
You'll have to figure out what you want to do with file names with non-default encodings. Perhaps just encoding them back to original bytes and decode them with unknown replacement. Use this for display but keep the original name to access the file.
>>> b'C\xc3N'.decode('utf8','replace')
C�N
os.walk
can also take a byte string and will return byte strings instead of Unicode strings:
for p,d,f in os.walk(b'.'):
Then you can decode as you like.
这篇关于Python 3: os.walk() 文件路径 UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!