在python中使用编解码器utf-8导致文件打开错误 [英] File open error by using codec utf-8 in python
问题描述
我在Windows XP和python 2.6.4上执行以下代码
I execute following code on windows xp and python 2.6.4
但是它显示IOError.
But it show IOError.
如何打开名称为utf-8编解码器的文件.
How to open file whose name has utf-8 codec.
>>> open( unicode('한글.txt', 'euc-kr').encode('utf-8') )
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
open( unicode('한글.txt', 'euc-kr').encode('utf-8') )
IOError: [Errno 22] invalid mode ('r') or filename: '\xed\x95\x9c\xea\xb8\x80.txt'
但是下面的代码才能正常运行.
But the following code to the normal operation.
>>> open( unicode('한글.txt', 'euc-kr') )
<open file u'\ud55c\uae00.txt', mode 'r' at 0x01DD63E0>
推荐答案
Windows向Python公开的C运行时接口使用系统代码页来编码文件名.与OS X和现代Linux版本不同,在Windows上,系统代码页绝不会是UTF-8.因此,UTF-8字节字符串将毫无用处.
The C runtime interface that Windows exposes to Python uses the system code page to encode filenames. Unlike on OS X and modern Linux versions, on Windows the system code page is never UTF-8. So the UTF-8 byte string won't be any good.
您可以使用.encode('mbcs')
将文件名编码为当前代码页,在您的情况下,该文件名可能等效于.encode('cp949')
.为了使其与其他文件名为UTF-8的平台兼容,您可以查找sys.getfilesystemencoding
,它将在那里提供utf-8
或在Windows上为mbcs
.
You could encode the filename to the current code page using .encode('mbcs')
, which in your case is probably equivalent to .encode('cp949')
. To make it compatible with other platforms where filenames are UTF-8, you could look up sys.getfilesystemencoding
, which will give you utf-8
there or mbcs
on Windows.
尽管cp949
适用于朝鲜语字符,但它会破坏该代码页(EUC-KR的扩展版本)范围以外的所有内容.
However whilst cp949
would work for Korean characters, it would break on anything outside the repertoire of that code page (an extended version of EUC-KR).
因此,另一种方法是将文件名保留为Unicode.在Windows上,它将使用Unicode本机接口将文件名以内部使用的UTF-16LE编码传递给Windows. (有关此功能的更多信息,请参见 PEP277 .)
So another approach is to keep your filenames as Unicode. On Windows this will use the Unicode-native interfaces to pass filenames to Windows in the UTF-16LE encoding it uses internally. (See PEP277 for more on this feature.)
这通常也可以在其他平台上使用:Linux和OS X应该为您默默地将Unicode文件名编码为UTF-8.在较旧的Python版本中,此操作可能会失败更多,但这是处理Python 3中文件名的默认方式(默认字符串类型已更改为Unicode).
This does generally still work on other platforms too: Linux and OS X should silently encode the Unicode filenames to UTF-8 for you. This may fail more in older Python versions, but it's the default way to handle filenames in Python 3 (where the default string type has changed to Unicode).
在Python 2上使用Unicode文件名要注意的陷阱是:
The traps to watch out for with using Unicode filenames on Python 2 are:
-
如果 os.path.supports_unicode_filenames 是False,因为它将在Windows之外,所以返回文件名的函数(例如
os.listdir
)将始终为您提供字节字符串.您必须检测到并使用sys.getfilesystemencoding
对其进行解码.
if os.path.supports_unicode_filenames is False, as it will be outside Windows, the functions that return filenames, such as
os.listdir
, will always give you byte strings. You'd have to detect that and decode them usingsys.getfilesystemencoding
.
如果您在Linux/OS X上有一个名称不是有效的UTF-8字符串的文件,则将无法为其获取Unicode文件名(如果尝试,则为UnicodeDecodeError).有点麻烦,但这会导致令人讨厌的无法访问的文件.
if you have a file on Linux/OS X with a name that's not a valid UTF-8 string, you won't be able to get a Unicode filename for it (UnicodeDecodeError if you try). Bit of a corner case, but it can lead to annoying inaccessible files.
偶然地,
open(unicode('한글.txt', 'euc-kr'))
您可能想在其中说'cp949'
(因为Windows韩语代码页与EUC-KR略有不同).或者,更一般地说,是'mbcs'
,它会为您提供系统代码页,该页可能与您的控制台键入的页面相同.无论如何,我对PyShell并不了解,但是通常如果上面的方法可行,那么您应该可以直接键入它:
Probably you would want to say 'cp949'
there (as the Windows Korean code page has minor differences to EUC-KR). Or, more generally, 'mbcs'
, which gives you the system code page which is presumably going to be the same one your console is typing. Anyway, I don't know about PyShell, but normally if the above works then you should just be able to type it directly:
open(u'한글')
这篇关于在python中使用编解码器utf-8导致文件打开错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!