如何在Python中处理无法破解的文件名? [英] How to handle undecodable filenames in Python?

查看:109
本文介绍了如何在Python中处理无法破解的文件名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我真的很想让我的Python应用程序在内部专门处理Unicode字符串.最近,这对我来说进展顺利,但是我在处理路径时遇到了一个问题.用于文件系统的POSIX API不是Unicode,因此文件可能(实际上在某种程度上是常见的)具有无法解码"的名称:未以文件系统规定的编码方式编码的文件名.

在Python中,这表现为从os.listdir()返回的unicodestr对象的混合.

>>> os.listdir(u'/path/to/foo')
[u'bar', 'b\xe1z']

在该示例中,即使(假设的)文件系统报告了sys.getfilesystemencoding() == 'UTF-8'(在UTF-8中,该字符也将是两个字节'\xc3\xa1'),字符'\xe1'还是用Latin-1或类似语言编码.因此,如果尝试将os.path.join()与Unicode路径一起使用,则会在各处出现UnicodeError,因为文件名无法解码.

Python Unicode HOWTO 提供有关Unicode路径名的建议:

请注意,在大多数情况下,应使用Unicode API.字节API仅应在存在无法解码文件名的系统(即Unix系统)上使用.

因为我主要关心Unix系统,这是否意味着我应该重组程序以仅处理路径的字节串吗? (如果是这样,我如何保持Windows兼容性?)或者还有其他更好的方法来处理不可破译的文件名吗?它们是否足够稀少,以至于我只能要求用户重命名该死的文件?

(如果最好只在内部处理字节串,我有一个后续问题:如何在SQLite中为一列存储字节串,同时将其余数据保持为友好的Unicode字符串?)

解决方案

如果您愿意切换到Python 3.1或更高版本,Python确实可以解决该问题:

PEP 383-系统字符接口中的不可解码字节. /p>

I'd really like to have my Python application deal exclusively with Unicode strings internally. This has been going well for me lately, but I've run into an issue with handling paths. The POSIX API for filesystems isn't Unicode, so it's possible (and actually somewhat common) for files to have "undecodable" names: filenames that aren't encoded in the filesystem's stated encoding.

In Python, this manifests as a mixture of unicode and str objects being returned from os.listdir().

>>> os.listdir(u'/path/to/foo')
[u'bar', 'b\xe1z']

In that example, the character '\xe1' is encoded in Latin-1 or somesuch, even when the (hypothetical) filesystem reports sys.getfilesystemencoding() == 'UTF-8' (in UTF-8, that character would be the two bytes '\xc3\xa1'). For this reason, you'll get UnicodeErrors all over the place if you try to use, for example, os.path.join() with Unicode paths, because the filename can't be decoded.

The Python Unicode HOWTO offers this advice about unicode pathnames:

Note that in most occasions, the Unicode APIs should be used. The bytes APIs should only be used on systems where undecodable file names can be present, i.e. Unix systems.

Because I mainly care about Unix systems, does this mean I should restructure my program to deal only with bytestrings for paths? (If so, how can I maintain Windows compatibility?) Or are there other, better ways of dealing with undecodable filenames? Are they rare enough "in the wild" that I should just ask users to rename their damn files?

(If it is best to just deal with bytestrings internally, I have a followup question: How do I store bytestrings in SQLite for one column while keeping the rest of the data as friendly Unicode strings?)

解决方案

Python does have a solution to the problem, if you're willing to switch to Python 3.1 or later:

PEP 383 - Non-decodable Bytes in System Character Interfaces.

这篇关于如何在Python中处理无法破解的文件名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆