UTF-8和os.listdir() [英] UTF-8 and os.listdir()

查看:107
本文介绍了UTF-8和os.listdir()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对包含"ș"字符(在UTF-8中为\xC8\x99-下方带有逗号的拉丁文小写字母S)的文件有点麻烦.

I'm having a bit of trouble with a file containing the "ș" character (that's \xC8\x99 in UTF-8 - LATIN SMALL LETTER S WITH COMMA BELOW).

我正在创建一个ș.txt文件,并尝试使用os.listdir()找回它.不幸的是,os.listdir()将其返回为s\xCC\xA6("s" +下面的COMMAING COMMA),而我的测试程序(下面)失败了.

I'm creating a ș.txt file and trying to get it back with os.listdir(). Unfortunately, os.listdir() returns it back as s\xCC\xA6 ("s" + COMBINING COMMA BELOW) and my test program (below) fails.

这在我的OS X上发生,但在Linux机器上有效.任何想法究竟是什么导致了此行为(两个环境都使用LANG = en_US.UTF8配置)?

This happens on my OS X, but it works on a Linux machine. Any idea what exactly causes this behavior (both environments are configured with LANG=en_US.UTF8) ?

这是测试程序:

#coding: utf-8
import os

fname = "ș.txt"
with open(fname, "w") as f:
    f.write("hi")

files = os.listdir(".")
print "fname: ", fname
print "files: ", files

if fname in files:
    print "found"
else:
    print "not found"

推荐答案

OS X文件系统主要使用分解的字符,而不是它们的组合形式.您需要将文件名标准化回NFC组合标准化格式:

The OS X filesystem mostly uses decomposed characters rather than their combined form. You'll need to normalise the filenames back to the NFC combined normalised form:

import unicodedata
files = [unicodedata.normalize('NFC', f) for f in os.listdir(u'.')]

这会将文件名处理为Unicode ;否则,您需要先将字节字符串解码为unicode.

This processes filenames as unicode; you'd otherwise need to decode the bytestring to unicode first.

另请参见 unicodedata.normalize()函数文档.

Also see the unicodedata.normalize() function documentation.

这篇关于UTF-8和os.listdir()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆