处理非英文文本 [英] Processing non-english text
问题描述
我有一个python文件,该文件可读取用户提供的文件,对其进行处理并以闪存卡格式提出问题.该程序可以很好地处理英文txt文件,但是在尝试处理法语文件时遇到错误.
I have a python file that reads a file given by the user, processes it, and ask questions in flash card format. The program works fine with an english txt file but I encounter errors when trying to process a french file.
当我第一次遇到该错误时,我正在使用Windows命令提示符窗口并运行python cards.py
.输入法语文件时,我立即得到一个UnicodeEncodeError
.深入研究后,我发现这可能与我使用cmd窗口有关.所以我尝试使用IDLE.我没有收到任何错误,但会得到诸如œ
和Ã
和®
之类的怪异字符.
When I first encountered the error, I was using the windows command prompt window and running python cards.py
. When inputting the french file, I immediately got a UnicodeEncodeError
. After digging around, I found that it may have something to do with the fact I was using the cmd window. So I tried using IDLE. I didn't get any errors but I would get weird characters like œ
and Ã
and ®
.
在进一步研究中,我发现了一些文档,指示使用open(file)
部分中.在IDLE中再次运行该程序后,似乎可以最大程度地减少问题,但是我仍然会得到一些奇怪的字符.在cmd中运行它时,它不会立即中断,但最终会在遇到未知字符时终止.
Upon further research, I found some documentation that instructs to use encoding='insert encoding type'
in the open(file)
part of my code. After running the program again in IDLE, it seemed to minimize the problem, but I would still get some weird characters. When running it in the cmd, it wouldn't break IMMEDIATELY, but would eventually when it encountered an unknown character.
我的问题:为确保程序可以处理文件中的所有字符(给定任何语言),我将执行什么操作?为什么IDLE和命令提示符对文件的处理方式不同?
My question: what do I implement to ensure the program can handle ALL of the chaaracters in the file (given any language) and why does IDLE and the command prompt handle the file differently?
我忘了提到我最终使用了utf-8,它给出了我所描述的结果.
I forgot to mention that I ended up using utf-8 which gave the results I described.
推荐答案
这是常见问题. 似乎您使用的是不支持unicode的cmd,因此在翻译输出到您的cmd运行的编码过程中会发生错误.而且由于unicode具有比cmd中使用的编码更宽的字符集,因此会产生错误
It's common question. Seems that you're using cmd which doesn't support unicode, so error occurs during translation of output to the encoding, which your cmd runs. And as unicode has a wider charset, than encoding used in cmd, it gives an error
IDLE构建在tkinter的Text小部件之上,该部件完全支持unicode中的Python字符串.
IDLE is built ontop of tkinter's Text widget, which perfectly supports Python strings in unicode.
最后,当您指定要打开的文件时,open
函数将假定该文件处于平台默认设置(每个locale.getpreferredencoding()
).因此,如果文件编码不同,则应在关键字arg encoding
到open
func中准确提及它.
And, finally, when you specify a file you'd like to open, the open
function assumes that it's in platform default (per locale.getpreferredencoding()
). So if your file encoding differs, you should exactly mention it in keyword arg encoding
to open
func.
这篇关于处理非英文文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!