在Mac OS X中对文件系统的Unicode编码在Python中不正确? [英] Unicode encoding for filesystem in Mac OS X not correct in Python?

查看:480
本文介绍了在Mac OS X中对文件系统的Unicode编码在Python中不正确?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与OS X和Python中的Unicode文件名争执一番。我试图在后面的代码中使用文件名作为正则表达式的输入,但是文件名中使用的编码似乎与sys.getfilesystemencoding()告诉我的不同。以下面的代码:
$ b $ pre $ #!/ usr / bin / env python
#coding = utf-8

import sys,os
print sys.getfilesystemencoding()
$ b $ = u'/ temp / s /'
s =u'åäö'
print's',[ord(c)for c in s],s
s2 = s.encode(sys.getfilesystemencoding())
print's2',[ord(c)for c in s2],s2
os.mkdir(p + s)
for os.listdir(p):
print'dir',[ord(c)for c in d], d

输出如下:

 utf-8 
s [229,228,246] b
s2 [195,165,195,164,195,182]
dir [97 ,778,97,776,111,776]

所以文件系统编码是utf-8 ,但是当我使用它编码我的文件名时,它将不会像用相同的字符串创建一个目录名称一样。我希望当我使用我的字符串创建一个dir并读取它的名字时,它应该使用与直接应用编码相同的代码。

如果读读读读内亦内读亦亦读范范范范范读内容详范范文中内预范范范范范范范范范范范范范辛o +¨=ö,这使得两个字符,而不是一个。我怎样才能避免这种差异,在Python中是否存在与OS X匹配的编码方案,为什么不是getfilesystemencoding()给了我正确的结果?

或者我搞砸了?

解决方案

MacOS X使用一种特殊的分解UTF-8来存储文件名。如果你需要读取文件名并将它们写入普通UTF-8文件,必须对它们进行标准化:

  filename = unicodedata.normalize ('NFC',unicode(filename,'utf-8'))。encode('utf-8')

在这里: https://web.archive.org/web/20120423075412/http://boodebr.org/main/python/all-about-python-and-unicode


Having a bit of struggle with Unicode file names in OS X and Python. I am trying to use filenames as input for a regular expression later in the code, but the encoding used in the filenames seem to be different from what sys.getfilesystemencoding() tells me. Take the following code:

#!/usr/bin/env python
# coding=utf-8

import sys,os
print sys.getfilesystemencoding()

p = u'/temp/s/'
s = u'åäö'
print 's', [ord(c) for c in s], s
s2 = s.encode(sys.getfilesystemencoding())
print 's2', [ord(c) for c in s2], s2
os.mkdir(p+s)
for d in os.listdir(p):
  print 'dir', [ord(c) for c in d], d

It outputs the following:

utf-8
s [229, 228, 246] åäö
s2 [195, 165, 195, 164, 195, 182] åäö
dir [97, 778, 97, 776, 111, 776] åäö

So, file system encoding is utf-8, but when I encode my filename åäö using that, it will not be the same as if I create a dir name with the same string. I expect that when I use my string åäö to create a dir, and read it's name back, it should use the same codes as if I applied the encoding directly.

If we look at the code points 97, 778, 97, 776, 111, 776, it's basically ASCII characters with added diacritic, e.g. o + ¨ = ö, which makes it two characters, not one. How can I avoid this discrepancy, is there an encoding scheme in Python that matches this behaviour by OS X, and why is not getfilesystemencoding() giving me the right result?

Or have I messed up?

解决方案

MacOS X uses a special kind of decomposed UTF-8 to store filenames. If you need to e.g. read in filenames and write them to a "normal" UTF-8 file, you must normalize them :

filename = unicodedata.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')

from here: https://web.archive.org/web/20120423075412/http://boodebr.org/main/python/all-about-python-and-unicode

这篇关于在Mac OS X中对文件系统的Unicode编码在Python中不正确?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆