在Python中使用glob.glob和正则表达式与unicode文件名的独立于文件系统的方式 [英] Filesystem independent way of using glob.glob and regular expressions with unicode filenames in Python
问题描述
(在Mac上,使用IPython,Python 2.7):
$ b
In [7]:from glob import glob
在[8]中:!touch'ü-0.'#在当前文件夹中创建文件
在[9]中:glob(u'ü - *。é' )
[9]:[]
在[10]中:import unicodedata as U
In [11]:glob(U.normalize('NFD ',u'ü - *。é'))
Out [11]:[u'u\\\̈-0.e\\\́']
pre> 然而,这在Linux或Windows上不行,我需要 unicode.normalize('NFC',u'ü- * .E')
。当我尝试将文件名与正则表达式匹配时,也会出现同样的问题:只有在Mac上标准化为 NFD
的unicode正则表达式与文件名匹配,而只有 NFC
正则表达式匹配在Linux / Windows上读取的文件名(在这两种情况下,我使用 re.UNICODE
标志)
有没有一个标准的方法来处理这个问题?
我希望就像 sys.getfilesystemencoding( )
返回文件系统的编码,将存在一个返回底层文件系统使用的Unicode标准化的函数。
然而,可能找不到这样的功能,也没有一个安全/标准的方式来进行功能测试。
Mac + HFS +
使用NFD标准化: https://apple.stackexchange.com/a/10484
Linux + Windows使用NFC标准化: http://qerub.se/filenames-and-unicode-normalization-forms
链接到代码: https:// github.com/musically-ut/seqfile/blob/feat-unicode/seqfile/seqfile.py
我假设你想匹配unicode 文件名,例如您希望输入模式 理解这不是Linux上的默认设置,其中字节是字节,并不是每个文件名在当前系统编码中都是有效的unicode字符串(虽然Python 3使用'surrogateescape'错误处理程序将它们表示为<$ c 有鉴于此,这是我的解决方案: I am working on a library which I want to keep platform, filesystem and Python2.x/3.x independent. However, I don't know how to glob for files and match the filenames against regular expressions in a platform/file-system independent way. E.g. (on Mac, using IPython, Python 2.7): However, this doesn't work on Linux or Windows, where I would need Is there a standard way of handling this problem? My hope is that just like However, I could find neither such a function nor a safe/standard way to feature-test for it. Mac + Linux + Windows use NFC normalization: http://qerub.se/filenames-and-unicode-normalization-forms Link to code: https://github.com/musically-ut/seqfile/blob/feat-unicode/seqfile/seqfile.py I'm assuming you want to match unicode equivalent filenames, e.g. you expect an input pattern of You have to understand that this is not the default on Linux, where bytes are taken as bytes, and where not every filename is a valid unicode string in the current system encoding (although Python 3 uses the 'surrogateescape' error handler to represent these as With that in mind, this is my solution:
这篇关于在Python中使用glob.glob和正则表达式与unicode文件名的独立于文件系统的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! u'\xE9 *'
能够匹配两个文件名 u'\xE9qui'
和 u'e \\\́qui'
在任何操作系统上,即字符级模式匹配。
def myglob(pattern,directory = u'。'):
pattern = unicodedata.normalize('NFC',pattern)
results = []
enc = sys.getfilesystemencoding()
用于os.listdir(目录)中的名称:
如果isinstance(名称,字节):
try:
name = name .decode(enc)
,除了UnicodeDecodeError:
#非正确unicode的文件名将不会匹配任何模式
continue $ b $如果fnmatch.filter([unicodedata.normalize('NFC',name)],pattern):
results.append信范范内预亦 In[7]: from glob import glob
In[8]: !touch 'ü-0.é' # Create the file in the current folder
In[9]: glob(u'ü-*.é')
Out[9]: []
In[10]: import unicodedata as U
In[11]: glob(U.normalize('NFD', u'ü-*.é'))
Out[11]: [u'u\u0308-0.e\u0301']
unicode.normalize('NFC', u'ü-*.é')
. The same problem arises when I try to match the filename against a regular expression: only a unicode regular expression normalized as NFD
on Mac matches the filename whereas only an NFC
regular expression matches filenames read on Linux/Windows (I use the re.UNICODE
flag in both instances).sys.getfilesystemencoding()
returns the encoding for the file system, there would exist a function which returns the Unicode normalization used by the underlying filesystem.
HFS+
uses NFD normalization: https://apple.stackexchange.com/a/10484u'\xE9*'
to match both filenames u'\xE9qui'
and u'e\u0301qui'
on any operating system, i.e. character-level pattern matching.str
anyway).def myglob(pattern, directory=u'.'):
pattern = unicodedata.normalize('NFC', pattern)
results = []
enc = sys.getfilesystemencoding()
for name in os.listdir(directory):
if isinstance(name, bytes):
try:
name = name.decode(enc)
except UnicodeDecodeError:
# Filenames that are not proper unicode won't match any pattern
continue
if fnmatch.filter([unicodedata.normalize('NFC', name)], pattern):
results.append(name)
return results