在Python中使用glob.glob和正则表达式与unicode文件名的独立于文件系统的方式 [英] Filesystem independent way of using glob.glob and regular expressions with unicode filenames in Python

查看:341
本文介绍了在Python中使用glob.glob和正则表达式与unicode文件名的独立于文件系统的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

预观意气中进程预范亦作信息范范范范辛中内意范如信息中信范读方信范范辛中信中读范读内目读信息信息信范预中但是,我不知道如何在文件系统中以文件和文件系统独立的方式来匹配正则表达式的文件名。



(在Mac上,使用IPython,Python 2.7):
$ b

In [7]:from glob import glob
在[8]中:!touch'ü-0.'#在当前文件夹中创建文件

在[9]中:glob(u'ü - *。é' )
[9]:[]

在[10]中:import unicodedata as U

In [11]:glob(U.normalize('NFD ',u'ü - *。é'))
Out [11]:[u'u\\\̈-0.e\\\́']
pre>

然而,这在Linux或Windows上不行,我需要 unicode.normalize('NFC',u'ü- * .E')。当我尝试将文件名与正则表达式匹配时,也会出现同样的问题:只有在Mac上标准化为 NFD 的unicode正则表达式与文件名匹配,而只有 NFC 正则表达式匹配在Linux / Windows上读取的文件名(在这两种情况下,我使用 re.UNICODE 标志)



有没有一个标准的方法来处理这个问题?

我希望就像 sys.getfilesystemencoding( )返回文件系统的编码,将存在一个返回底层文件系统使用的Unicode标准化的函数。



然而,可能找不到这样的功能,也没有一个安全/标准的方式来进行功能测试。






Mac + HFS + 使用NFD标准化: https://apple.stackexchange.com/a/10484

Linux + Windows使用NFC标准化: http://qerub.se/filenames-and-unicode-normalization-forms



链接到代码: https:// github.com/musically-ut/seqfile/blob/feat-unicode/seqfile/seqfile.py

解决方案

我假设你想匹配unicode 文件名,例如您希望输入模式 u'\xE9 *'能够匹配两个文件名 u'\xE9qui' u'e \\\́qui'在任何操作系统上,即字符级模式匹配。

理解这不是Linux上的默认设置,其中字节是字节,并不是每个文件名在当前系统编码中都是有效的unicode字符串(虽然Python 3使用'surrogateescape'错误处理程序将它们表示为<$ c

有鉴于此,这是我的解决方案:

  def myglob(pattern,directory = u'。'):
pattern = unicodedata.normalize('NFC',pattern)
results = []
enc = sys.getfilesystemencoding()
用于os.listdir(目录)中的名称:
如果isinstance(名称,字节):
try:
name = name .decode(enc)
,除了UnicodeDecodeError:
#非正确unicode的文件名将不会匹配任何模式
continue $ b $如果fnmatch.filter([unicodedata.normalize('NFC',name)],pattern):
results.append信范范内预亦

I am working on a library which I want to keep platform, filesystem and Python2.x/3.x independent. However, I don't know how to glob for files and match the filenames against regular expressions in a platform/file-system independent way.

E.g. (on Mac, using IPython, Python 2.7):

   In[7]: from glob import glob
   In[8]: !touch 'ü-0.é' # Create the file in the current folder

   In[9]: glob(u'ü-*.é')
  Out[9]: []

   In[10]: import unicodedata as U

   In[11]: glob(U.normalize('NFD', u'ü-*.é'))
  Out[11]: [u'u\u0308-0.e\u0301']

However, this doesn't work on Linux or Windows, where I would need unicode.normalize('NFC', u'ü-*.é'). The same problem arises when I try to match the filename against a regular expression: only a unicode regular expression normalized as NFD on Mac matches the filename whereas only an NFC regular expression matches filenames read on Linux/Windows (I use the re.UNICODE flag in both instances).

Is there a standard way of handling this problem?

My hope is that just like sys.getfilesystemencoding() returns the encoding for the file system, there would exist a function which returns the Unicode normalization used by the underlying filesystem.

However, I could find neither such a function nor a safe/standard way to feature-test for it.


Mac + HFS+ uses NFD normalization: https://apple.stackexchange.com/a/10484

Linux + Windows use NFC normalization: http://qerub.se/filenames-and-unicode-normalization-forms

Link to code: https://github.com/musically-ut/seqfile/blob/feat-unicode/seqfile/seqfile.py

解决方案

I'm assuming you want to match unicode equivalent filenames, e.g. you expect an input pattern of u'\xE9*' to match both filenames u'\xE9qui' and u'e\u0301qui' on any operating system, i.e. character-level pattern matching.

You have to understand that this is not the default on Linux, where bytes are taken as bytes, and where not every filename is a valid unicode string in the current system encoding (although Python 3 uses the 'surrogateescape' error handler to represent these as str anyway).

With that in mind, this is my solution:

def myglob(pattern, directory=u'.'):
    pattern = unicodedata.normalize('NFC', pattern)
    results = []
    enc = sys.getfilesystemencoding()
    for name in os.listdir(directory):
        if isinstance(name, bytes):
            try:
                name = name.decode(enc)
            except UnicodeDecodeError:
                # Filenames that are not proper unicode won't match any pattern
                continue
        if fnmatch.filter([unicodedata.normalize('NFC', name)], pattern):
            results.append(name)
    return results

这篇关于在Python中使用glob.glob和正则表达式与unicode文件名的独立于文件系统的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆