在 Windows 上的 Python 3.4 中处理 Unicode 文件名 [英] handling Unicode filenames in Python 3.4 on Windows

查看:39
本文介绍了在 Windows 上的 Python 3.4 中处理 Unicode 文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到一种可靠的方法来在 Python 中扫描 Windows 上的文件,同时考虑到文件名中可能存在各种 Unicode 代码点的可能性.我已经看到了几个针对此问题的建议解决方案,但没有一个适用于所有我在扫描由现实世界的软件和用户创建的文件名时遇到的实际问题.

I'm trying to find a reliable way to scan files on Windows in Python, while allowing for the possibility that there may be various Unicode code points in the filenames. I've seen several proposed solutions to this problem, but none of them work for all of the actual issues that I've encountered in scanning filenames created by real-world software and users.

下面的代码示例试图解决和演示核心问题.它在一个子文件夹中创建三个文件,其中包含我遇到的各种变体,然后尝试扫描该文件夹并显示每个文件名,后跟文件内容.它会在尝试读取第三个测试文件时崩溃,并出现 OSError [Errno 22] Invalid argument.

The code sample below is an attempt to extricate and demonstrate the core issue. It creates three files in a subfolder with the sorts of variations I've encountered, and then attempts to scan through that folder and display each filename followed by the file's contents. It will crash on the attempt to read the third test file, with OSError [Errno 22] Invalid argument.

import os

# create files in .\temp that demonstrate various issues encountered in the wild
tempfolder = os.getcwd() + '\\temp'
if not os.path.exists(tempfolder):
    os.makedirs(tempfolder)
print('file contents', file=open('temp/simple.txt','w'))
print('file contents', file=open('temp/with a ® symbol.txt','w'))
print('file contents', file=open('temp/with these chars ΣΑΠΦΩ.txt','w'))

# goal is to scan the files in a manner that allows for printing
# the filename as well as opening/reading the file ...
for root,dirs,files in os.walk(tempfolder.encode('UTF-8')):
    for filename in files:
        fullname = os.path.join(tempfolder.encode('UTF-8'), filename)
        print(fullname)
        print(open(fullname,'r').read())

正如代码中所说,我只想能够显示文件名并打开/读取文件.关于文件名的显示,我不在乎 Unicode 字符是否在特殊情况下正确呈现.我只想以唯一标识正在处理的文件的方式打印文件名,并且不会为这些不寻常的文件名类型引发错误.

As it says in the code, I just want to be able to display the filenames and open/read the files. Regarding display of the filename, I don't care whether the Unicode characters are rendered correctly for the special cases. I just want to print the filename in a manner that uniquely identifies which file is being processed, and doesn't throw an error for these unusual sorts of filenames.

如果您注释掉最后一行代码,这里显示的方法将显示所有三个文件名,没有错误.但它不会打开名称中带有杂项 Unicode 的文件.

If you comment out the final line of code, the approach shown here will display all three filenames with no errors. But it won't open the file with miscellaneous Unicode in the name.

是否有一种方法可以在 Python 中可靠地显示/打开所有这三个文件名变体?我希望有,但我对 Unicode 细微之处的有限掌握使我无法看到它.

Is there a single approach that will reliably display/open all three of these filename variations in Python? I'm hoping there is, and my limited grasp of Unicode subtleties is preventing me from seeing it.

推荐答案

以下工作正常,if 以声明的编码保存文件,并且 if 使用支持显示字符的 IDE 或终端编码.请注意,这不必是 UTF-8.文件顶部的声明只是源文件的编码.

The following works fine, if you save the file in the declared encoding, and if you use an IDE or terminal encoding that supports the characters being displayed. Note that this does not have to be UTF-8. The declaration at the top of the file is the encoding of the source file only.

#coding:utf8
import os

# create files in .\temp that demonstrate various issues encountered in the wild
tempfolder = os.path.join(os.getcwd(),'temp')
if not os.path.exists(tempfolder):
    os.makedirs(tempfolder)
print('file contents', file=open('temp/simple.txt','w'))
print('file contents', file=open('temp/with a ® symbol.txt','w'))
print('file contents', file=open('temp/with these chars ΣΑΠΦΩ.txt','w'))

# goal is to scan the files in a manner that allows for printing
# the filename as well as opening/reading the file ...
for root,dirs,files in os.walk(tempfolder):
    for filename in files:
        fullname = os.path.join(tempfolder, filename)
        print(fullname)
        print(open(fullname,'r').read())

输出:

c:\\temp\simple.txt
file contents

c:\temp\with a ® symbol.txt
file contents

c:\temp\with these chars ΣΑΠΦΩ.txt
file contents

如果您使用的终端不支持对文件名中使用的字符进行编码,您将收到 UnicodeEncodeError.更改:

If you use a terminal that does not support encoding the characters used in the filename, You will get UnicodeEncodeError. Change:

print(fullname)

到:

print(ascii(fullname))

你会看到文件名被正确读取,但无法在终端编码中打印一个或多个符号:

and you will see that the filename was read correctly, but just couldn't print one or more symbols in the terminal encoding:

'C:\\temp\\simple.txt'
file contents

'C:\\temp\\with a \xae symbol.txt'
file contents

'C:\\temp\\with these chars \u03a3\u0391\u03a0\u03a6\u03a9.txt'
file contents

这篇关于在 Windows 上的 Python 3.4 中处理 Unicode 文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆