处理文件名时出现 UnicodeDecodeError [英] UnicodeDecodeError while processing filenames

查看:44
本文介绍了处理文件名时出现 UnicodeDecodeError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Ubuntu 12 x64 上使用 Python 2.7.3.

I'm using Python 2.7.3 on Ubuntu 12 x64.

我的文件系统上的一个文件夹中有大约 200,000 个文件.一些文件的文件名包含 html 编码和转义字符,因为这些文件最初是从网站下载的.以下是示例:

I have about 200,000 files in a folder on my filesystem. The file names of some of the files contain html encoded and escaped characters because the files were originally downloaded from a website. Here are examples:

牙买加%2008%20114.jpg
thai_trip_%E8%B0%83%E6%95%B4%E5%A4%A7%E5%B0%8F%20RAY_5313.jpg

Jamaica%2008%20114.jpg
thai_trip_%E8%B0%83%E6%95%B4%E5%A4%A7%E5%B0%8F%20RAY_5313.jpg

我编写了一个简单的 Python 脚本,它遍历文件夹并使用文件名中的编码字符重命名所有文件.新文件名是通过简单地解码组成文件名的字符串来实现的.

I wrote a simple Python script that goes through the folder and renames all of the files with encoded characters in the filename. The new filename is achieved by simply decoding the string that makes up the filename.

该脚本适用于大多数文件,但是,对于某些文件,Python 会卡住并吐出以下错误:

The script works for most of the files, but, for some of the files Python chokes and spits out the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)
Traceback (most recent call last):
  File "./download.py", line 53, in downloadGalleries
    numDownloaded = downloadGallery(opener, galleryLink)
  File "./download.py", line 75, in downloadGallery
    filePathPrefix = getFilePath(content)
  File "./download.py", line 90, in getFilePath
    return cleanupString(match.group(1).strip()) + '/' + cleanupString(match.group(2).strip())
  File "/home/abc/XYZ/common.py", line 22, in cleanupString
    return HTMLParser.HTMLParser().unescape(string)
  File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)

这是我的 cleanupString 函数的内容:

Here is the contents of my cleanupString function:

def cleanupString(string):
    string = urllib2.unquote(string)

    return HTMLParser.HTMLParser().unescape(string)

这是调用 cleanupString 函数的代码片段(此代码与上面回溯中的代码不同,但会产生相同的错误):

And here's the snippet of code that calls the cleanupString function (this code is not the same code in the traceback above but it produces the same error):

rootFolder = sys.argv[1]
pattern = r'.*\.jpg\s*$|.*\.jpeg\s*$'
reobj = re.compile(pattern, re.IGNORECASE)
imgs = []

for root, dirs, files in os.walk(rootFolder):
    for filename in files:
        foundFile = os.path.join(root, filename)

        if reobj.match(foundFile):
            imgs.append(foundFile)

for img in imgs :
    print 'Checking file: ' + img
    newImg = cleanupString(img) #Code blows up here for some files

谁能为我提供解决此错误的方法?我已经尝试添加

Can anyone provide me with a way to get around this error? I've already tried adding

# -*- coding: utf-8 -*-

到脚本的顶部,但没有效果.

to the top of the script but that has no effect.

谢谢.

推荐答案

您的文件名是包含表示 Unicode 字符的 UTF-8 字节的字节字符串.HTML 解析器通常使用 unicode 数据而不是字节字符串,特别是当它遇到与符号转义时,因此 Python 会自动尝试为您解码该值,但默认情况下它使用 ASCII 进行解码.这对于 UTF-8 数据失败,因为它包含超出 ASCII 范围的字节.

Your filenames are byte strings that contain UTF-8 bytes representing unicode characters. The HTML parser normally works with unicode data instead of byte strings, particularly when it encounters a ampersand escape, so Python is automatically trying to decode the value for you, but it by default uses ASCII for that decoding. This fails for UTF-8 data as it contains bytes that fall outside of the ASCII range.

您需要将字符串显式解码为 un​​icode 对象:

You need to explicitly decode your string to a unicode object:

def cleanupString(string):
    string = urllib2.unquote(string).decode('utf8')

    return HTMLParser.HTMLParser().unescape(string)

您的下一个问题将是您现在拥有 unicode 文件名,但是您的文件系统将需要某种编码来处理这些文件名.您可以使用 sys.getfilesystemencoding() 来检查该编码是什么;使用它来重新编码您的文件名:

Your next problem will be that you now have unicode filenames, but your filesystem will need some kind of encoding to work with these filenames. You can check what that encoding is with sys.getfilesystemencoding(); use this to re-encode your filenames:

def cleanupString(string):
    string = urllib2.unquote(string).decode('utf8')

    return HTMLParser.HTMLParser().unescape(string).encode(sys.getfilesystemencoding())

您可以在 Unicode HOWTO 中了解 Python 如何处理 Unicode.

You can read up on how Python deals with Unicode in the Unicode HOWTO.

这篇关于处理文件名时出现 UnicodeDecodeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆