Python 2.7:读取带有中文字符的文件 [英] Python 2.7: Read file with Chinese characters

查看:695
本文介绍了Python 2.7:读取带有中文字符的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试分析CSV文件中名称带有汉字的数据(例如粗1 25克)。
我正在使用Tkinter选择文件,如下所示:

I am trying to analyze data within CSV files with Chinese characters in their names (E.g. "粗1 25g"). I am using Tkinter to choose the files like so:

selectedFiles = askopenfilenames(filetypes=[("xlsx","*"),("xls","*")]) # Utilize Tkinker dialog window to choose files
selectedFiles = master.tk.splitlist(selectedFiles) # Create list from files chosen

我试图以这种方式将文件名转换为unicode:

I have attempted to convert the filename to unicode in this way:

selectedFiles = [x.decode("utf-8") for x in selectedFiles]

仅产生错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 0: ordinal not in range(128)

我也尝试过将文件名转换为文件用以下内容创建:

I have also tried converting the filenames as the files are created with the following:

titles = [x.encode('utf-8') for x in titles]

仅收到错误:

IOError: [Errno 22] invalid mode ('wb') or filename: 'C:\...\\data_division_files\\\xe7\xb2\x971 25g.csv'

我也尝试了上述方法的组合,但无济于事。
我该怎么做才能允许用Python读取这些文件?

I have also tried combinations of the above methods to no avail. What can I do to allow these files to be read in Python?

(此问题虽然相关,但仍无法解决我的问题:< a href = https://stackoverflow.com/questions/17410628/obtain-file-size-with-os-path-getsize-in-python-2-7-5>使用os.path.getsize获取文件大小()在Python 2.7.5中)

(This question,while related, has not been able to solve my problem: Obtain File size with os.path.getsize() in Python 2.7.5)

推荐答案

调用解码 unicode 对象上的c $ c>,它首先使用 sys.getdefaultencoding()对其进行编码,以便对其进行解码为了你。这就是为什么即使您在任何地方都没有要求输入ASCII的情况下也会出现关于ASCII的错误的原因。

When you call decode on a unicode object, it first encodes it with sys.getdefaultencoding() so it can decode it for you. Which is why you get an error about ASCII even though you didn't ask for ASCII anywhere.

因此,您在哪里得到 unicode 对象来自?来自 askopenfilename 。通过快速测试,看起来它总是在Windows上返回 unicode 值(大概是通过获取UTF-16并将其解码),而在POSIX上它返回一些 unicode 和一些 str (我猜想,可以将适合7位ASCII的所有内容都保留下来,用文件系统解码其他内容编码)。如果您尝试打印出repr或type或 selectedFiles 的任何内容,则问题将很明显。

So, where are you getting a unicode object from? From askopenfilename. From a quick test, it looks like it always returns unicode values on Windows (presumably by getting the UTF-16 and decoding it), while on POSIX it returns some unicode and some str (I'd guess by leaving alone anything that fits into 7-bit ASCII, decoding anything else with your filesystem encoding). If you'd tried printing out the repr or type or anything of selectedFiles, the problem would have been obvious.

同时, encode('utf-8')不会引起任何 UnicodeError s…,但是您的文件系统编码在Windows上可能不是UTF-8,因此可能会导致许多 IOError s(尝试打开不存在的文件或在不存在的目录中创建文件),21(尝试在Windows上使用非法文件名或目录名打开文件)等。看起来这正是您所需要的重新看到。而且确实没有理由这样做;只需按原样将路径名传递给 open 即可。

Meanwhile, the encode('utf-8') shouldn't cause any UnicodeErrors… but it's likely that your filesystem encoding isn't UTF-8 on Windows, so it will probably cause a lot of IOErrors with errno 2 (trying to open files that don't exist, or to create files in directories that don't exist), 21 (trying to open files with illegal file or directory names on Windows), etc. And it looks like that's exactly what you're seeing. And there's really no reason to do it; just pass the pathnames as-is to open and they'll be fine.

因此,基本上,如果您删除了所有 encode decode 调用,则您的代码可能会

So, basically, if you removed all of your encode and decode calls, your code would probably just work.

但是,还有一个更简单的解决方案:只需使用 askopenfile asksaveasfile 而不是 askopenfilename asksaveasfilename 。让Tk弄清楚如何使用其路径名,然后只将文件对象交给您,而不是自己弄乱路径名。

However, there's an even easier solution: Just use askopenfile or asksaveasfile instead of askopenfilename or asksaveasfilename. Let Tk figure out how to use its pathnames and just hand you the file objects, instead of messing with the pathnames yourself.

这篇关于Python 2.7:读取带有中文字符的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆