Python 3: os.walk() 文件路径 UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed [英] Python 3: os.walk() file paths UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed

查看:148
本文介绍了Python 3: os.walk() 文件路径 UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此代码:

 用于 os.walk('.') 中的根、目录、文件:打印(根)

给我这个错误:

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed

如何遍历文件树而不会得到这样的有毒字符串?

解决方案

在 Linux 上,文件名只是一堆字节",不一定以特定编码进行编码.Python 3 尝试将所有内容转换为 Unicode 字符串.在这样做的过程中,开发人员提出了一种方案,可以将字节字符串转换为 Unicode 字符串并返回而不会丢失,并且不知道原始编码.他们使用部分代理来编码坏"字节,但在打印到终端时,普通的 UTF8 编码器无法处理它们.

例如,这是一个非 UTF8 字节字符串:

<预><代码>>>>b'C\xc3N'.decode('utf8','surrogateescape')'C\udcc3N'

它可以在不丢失的情况下与 Unicode 相互转换:

<预><代码>>>>b'C\xc3N'.decode('utf8','surrogateescape').encode('utf8','surrogateescape')b'C\xc3N'

但是无法打印:

<预><代码>>>>打印(b'C\xc3N'.decode('utf8','surrogateescape'))回溯(最近一次调用最后一次):文件<stdin>",第 1 行,位于 <module>UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 1: surrogates not allowed

你必须弄清楚你想用非默认编码的文件名做什么.也许只是将它们编码回原始字节并使用未知替换对其进行解码.使用它来显示但保留原始名称以访问文件.

<预><代码>>>>b'C\xc3N'.decode('utf8','replace')网络

os.walk 也可以接受一个字节串并且将返回字节串而不是 Unicode 串:

for p,d,f in os.walk(b'.'):

然后你就可以随意解码了.

This code:

for root, dirs, files in os.walk('.'):
    print(root)

Gives me this error:

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed

How do I walk through a file tree without getting toxic strings like this?

解决方案

On Linux, filenames are 'just a bunch of bytes', and are not necessarily encoded in a particular encoding. Python 3 tries to turn everything into Unicode strings. In doing so the developers came up with a scheme to translate byte strings to Unicode strings and back without loss, and without knowing the original encoding. They used partial surrogates to encode the 'bad' bytes, but the normal UTF8 encoder can't handle them when printing to the terminal.

For example, here's a non-UTF8 byte string:

>>> b'C\xc3N'.decode('utf8','surrogateescape')
'C\udcc3N'

It can be converted to and from Unicode without loss:

>>> b'C\xc3N'.decode('utf8','surrogateescape').encode('utf8','surrogateescape')
b'C\xc3N'

But it can't be printed:

>>> print(b'C\xc3N'.decode('utf8','surrogateescape'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 1: surrogates not allowed

You'll have to figure out what you want to do with file names with non-default encodings. Perhaps just encoding them back to original bytes and decode them with unknown replacement. Use this for display but keep the original name to access the file.

>>> b'C\xc3N'.decode('utf8','replace')
C�N

os.walk can also take a byte string and will return byte strings instead of Unicode strings:

for p,d,f in os.walk(b'.'):

Then you can decode as you like.

这篇关于Python 3: os.walk() 文件路径 UnicodeEncodeError: 'utf-8' codec can't encode: surrogates not allowed的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆