执行 os.walk 时出现 UnicodeDecodeError [英] UnicodeDecodeError when performing os.walk

查看:35
本文介绍了执行 os.walk 时出现 UnicodeDecodeError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到错误:

'ascii' 编解码器无法解码位置 14 中的字节 0x8b:序号不在范围内 (128)

尝试执行 os.walk 时.发生该错误是因为目录中的某些文件中包含 0x8b(非 utf8)字符.这些文件来自 Windows 系统(因此是 utf-16 文件名),但我已将文件复制到 Linux 系统并使用 python 2.7(在 Linux 中运行)来遍历目录.

我尝试将 unicode 起始路径传递给 os.walk,以及所有文件和它生成的目录是 unicode 名称,直到遇到非 utf8 名称,然后出于某种原因,它不会将这些名称转换为 unicode,然后代码在 utf-16 名称上阻塞.除了手动查找和更改所有冒犯性名称之外,还有没有办法解决问题?

如果python2.7没有解决办法,能不能用python3写个脚本遍历文件树,把坏的文件名转换成utf-8(去掉非utf8字符)来修复?注意除了 0x8b 之外,名称中还有许多非 utf8 字符,因此它需要以通用方式工作.

更新:0x8b 仍然只是一个 btye 字符(只是无效的 ascii)这一事实使它更加令人费解.我已经验证了将这样的字符串转换为 unicode 存在问题,但是可以直接创建 unicode 版本.即:

<预><代码>>>>test = 'a string x8b with non-ascii'>>>测试'带有非 ascii 的字符串 x8b'>>>Unicode(测试)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中UnicodeDecodeError: 'ascii' 编解码器无法解码位置 9 中的字节 0x8b:序号不在范围内 (128)>>>>>>test2 = u'a string x8b with non-ascii'>>>测试2u'a string x8b with non-ascii'

这是我收到的错误的追溯:

80.对于 os.walk(unicode(startpath)) 中的根、目录、文件:walk 中的文件/usr/lib/python2.7/os.py"294. for x in walk(new_path, topdown, onerror, followlinks):walk 中的文件/usr/lib/python2.7/os.py"294. for x in walk(new_path, topdown, onerror, followlinks):walk 中的文件/usr/lib/python2.7/os.py"284.if isdir(join(top, name)):加入文件/usr/lib/python2.7/posixpath.py"71.路径+='/'+b异常类型:/admin/casebuilder/company/883/处的 UnicodeDecodeError异常值:ascii"编解码器无法解码位置 14 中的字节 0x8b:序号不在范围内(128)

问题的根源出现在listdir返回的文件列表中(os.walk的第276行):

names = listdir(top)

字符数 > 128 的名称作为非 unicode 字符串返回.

解决方案

这个问题源于两个基本问题.首先是 Python 2.x 默认编码是ascii",而 Linux 默认编码是utf8".您可以通过以下方式验证这些编码:

sys.getdefaultencoding() #pythonsys.getfilesystemencoding() #OS

当 os 模块函数返回目录内容时,即 os.walk &os.listdir 返回仅包含 ascii 文件名和非 ascii 文件名的文件列表,ascii 编码的文件名会自动转换为 unicode.其他的不是.因此,结果是一个包含 unicode 和 str 对象混合的列表.正是 str 对象可能会导致问题.由于它们不是ascii,python无法知道使用什么编码,因此无法自动解码为unicode.

因此,在执行 os.path(dir, file) 等常见操作时,其中 dir 是 unicode,file 是一个编码的 str,如果该文件不是 ascii 编码的(默认值).解决方案是在检索到每个文件名后立即检查它们,并使用适当的编码将 str(编码对象)对象解码为 un​​icode.

这是第一个问题及其解决方案.第二个有点棘手.由于这些文件最初来自 Windows 系统,它们的文件名可能使用一种称为 windows-1252 的编码.一种简单的检查方法是调用:

filename.decode('windows-1252')

如果结果是有效的 unicode 版本,则您可能具有正确的编码.您也可以通过在 unicode 版本上调用 print 来进一步验证,并查看呈现的正确文件名.

最后的皱纹.在具有 Windows 源文件的 Linux 系统中,可能甚至可能混合使用 windows-1252utf8 编码.有两种方法可以处理这种混合物.第一个也是更可取的是运行:

$ convmv -f windows-1252 -t utf8 -r DIRECTORY --notes

其中 DIRECTORY 包含需要转换的文件.此命令会将任何 windows-1252 编码的文件名转换为 utf8.它进行了智能转换,因为如果文件名已经是 utf8(或 ascii),则它什么也不做.

另一种方法(如果由于某种原因无法进行这种转换)是在 python 中动态执行类似的操作.即:

def decodeName(name):if type(name) == str: # 留下unicode的尝试:name = name.decode('utf8')除了:name = name.decode('windows-1252')返回名称

该函数首先尝试 utf8 解码.如果失败,则回退到 windows-1252 版本.在 os 调用返回文件列表后使用此函数:

root, dirs, files = os.walk(path):files = [decodeName(f) for f in files]# 现在对 unicode 文件名做一些事情

我个人发现 unicode 和编码的整个主题非常混乱,直到我阅读了这个精彩而简单的教程:

http://farmdev.com/talks/unicode/

我强烈推荐给任何在 unicode 问题上苦苦挣扎的人.

I am getting the error:

'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)

when trying to do os.walk. The error occurs because some of the files in a directory have the 0x8b (non-utf8) character in them. The files come from a Windows system (hence the utf-16 filenames), but I have copied the files over to a Linux system and am using python 2.7 (running in Linux) to traverse the directories.

I have tried passing a unicode start path to os.walk, and all the files & dirs it generates are unicode names until it comes to a non-utf8 name, and then for some reason, it doesn't convert those names to unicode and then the code chokes on the utf-16 names. Is there anyway to solve the problem short of manually finding and changing all the offensive names?

If there is not a solution in python2.7, can a script be written in python3 to traverse the file tree and fix the bad filenames by converting them to utf-8 (by removing the non-utf8 chars)? N.B. there are many non-utf8 chars in the names besides 0x8b, so it would need to work in a general fashion.

UPDATE: The fact that 0x8b is still only a btye char (just not valid ascii) makes it even more puzzling. I have verified that there is a problem converting such a string to unicode, but that a unicode version can be created directly. To wit:

>>> test = 'a string x8b with non-ascii'
>>> test
'a string x8b with non-ascii'
>>> unicode(test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 9: ordinal not in  range(128)
>>> 
>>> test2 = u'a string x8b with non-ascii'
>>> test2
u'a string x8b with non-ascii'

Here's a traceback of the error I am getting:

80.         for root, dirs, files in os.walk(unicode(startpath)):
File "/usr/lib/python2.7/os.py" in walk
294.             for x in walk(new_path, topdown, onerror, followlinks):
File "/usr/lib/python2.7/os.py" in walk
294.             for x in walk(new_path, topdown, onerror, followlinks):
File "/usr/lib/python2.7/os.py" in walk
284.         if isdir(join(top, name)):
File "/usr/lib/python2.7/posixpath.py" in join
71.             path += '/' + b

Exception Type: UnicodeDecodeError at /admin/casebuilder/company/883/
Exception Value: 'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)

The root of the problem occurs in the list of files returned from listdir (on line 276 of os.walk):

names = listdir(top)

The names with chars > 128 are returned as non-unicode strings.

解决方案

This problem stems from two fundamental problems. The first is fact that Python 2.x default encoding is 'ascii', while the default Linux encoding is 'utf8'. You can verify these encodings via:

sys.getdefaultencoding() #python
sys.getfilesystemencoding() #OS

When os module functions returning directory contents, namely os.walk & os.listdir return a list of files containing ascii only filenames and non-ascii filenames, the ascii-encoding filenames are converted automatically to unicode. The others are not. Therefore, the result is a list containing a mix of unicode and str objects. It is the str objects that can cause problems down the line. Since they are not ascii, python has no way of knowing what encoding to use, and therefore they can't be decoded automatically into unicode.

Therefore, when performing common operations such as os.path(dir, file), where dir is unicode and file is an encoded str, this call will fail if the file is not ascii-encoded (the default). The solution is to check each filename as soon as they are retrieved and decode the str (encoded ones) objects to unicode using the appropriate encoding.

That's the first problem and its solution. The second is a bit trickier. Since the files originally came from a Windows system, their filenames probably use an encoding called windows-1252. An easy means of checking is to call:

filename.decode('windows-1252')

If a valid unicode version results you probably have the correct encoding. You can further verify by calling print on the unicode version as well and see the correct filename rendered.

One last wrinkle. In a Linux system with files of Windows origin, it is possible or even probably to have a mix of windows-1252 and utf8 encodings. There are two means of dealing with this mixture. The first and preferable is to run:

$ convmv -f windows-1252 -t utf8 -r DIRECTORY --notest

where DIRECTORY is the one containing the files needing conversion.This command will convert any windows-1252 encoded filenames to utf8. It does a smart conversion, in that if a filename is already utf8 (or ascii), it will do nothing.

The alternative (if one cannot do this conversion for some reason) is to do something similar on the fly in python. To wit:

def decodeName(name):
    if type(name) == str: # leave unicode ones alone
        try:
            name = name.decode('utf8')
        except:
            name = name.decode('windows-1252')
    return name

The function tries a utf8 decoding first. If it fails, then it falls back to the windows-1252 version. Use this function after a os call returning a list of files:

root, dirs, files = os.walk(path):
    files = [decodeName(f) for f in files]
    # do something with the unicode filenames now

I personally found the entire subject of unicode and encoding very confusing, until I read this wonderful and simple tutorial:

http://farmdev.com/talks/unicode/

I highly recommend it for anyone struggling with unicode issues.

这篇关于执行 os.walk 时出现 UnicodeDecodeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆