处理非英文文本 [英] Processing non-english text

查看:133
本文介绍了处理非英文文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个python文件,该文件可读取用户提供的文件,对其进行处理并以闪存卡格式提出问题.该程序可以很好地处理英文txt文件,但是在尝试处理法语文件时遇到错误.

I have a python file that reads a file given by the user, processes it, and ask questions in flash card format. The program works fine with an english txt file but I encounter errors when trying to process a french file.

当我第一次遇到该错误时,我正在使用Windows命令提示符窗口并运行python cards.py.输入法语文件时,我立即得到一个UnicodeEncodeError.深入研究后,我发现这可能与我使用cmd窗口有关.所以我尝试使用IDLE.我没有收到任何错误,但会得到诸如œÃ®之类的怪异字符.

When I first encountered the error, I was using the windows command prompt window and running python cards.py. When inputting the french file, I immediately got a UnicodeEncodeError. After digging around, I found that it may have something to do with the fact I was using the cmd window. So I tried using IDLE. I didn't get any errors but I would get weird characters like œ and à and ®.

在进一步研究中,我发现了一些文档,指示使用在我的代码的open(file)部分中.在IDLE中再次运行该程序后,似乎可以最大程度地减少问题,但是我仍然会得到一些奇怪的字符.在cmd中运行它时,它不会立即中断,但最终会在遇到未知字符时终止.

Upon further research, I found some documentation that instructs to use encoding='insert encoding type' in the open(file) part of my code. After running the program again in IDLE, it seemed to minimize the problem, but I would still get some weird characters. When running it in the cmd, it wouldn't break IMMEDIATELY, but would eventually when it encountered an unknown character.

我的问题:为确保程序可以处理文件中的所有字符(给定任何语言),我将执行什么操作?为什么IDLE和命令提示符对文件的处理方式不同?

My question: what do I implement to ensure the program can handle ALL of the chaaracters in the file (given any language) and why does IDLE and the command prompt handle the file differently?

我忘了提到我最终使用了utf-8,它给出了我所描述的结果.

I forgot to mention that I ended up using utf-8 which gave the results I described.

推荐答案

这是常见问题. 似乎您使用的是不支持unicode的cmd,因此在翻译输出到您的cmd运行的编码过程中会发生错误.而且由于unicode具有比cmd中使用的编码更宽的字符集,因此会产生错误

It's common question. Seems that you're using cmd which doesn't support unicode, so error occurs during translation of output to the encoding, which your cmd runs. And as unicode has a wider charset, than encoding used in cmd, it gives an error

IDLE构建在tkinter的Text小部件之上,该部件完全支持unicode中的Python字符串.

IDLE is built ontop of tkinter's Text widget, which perfectly supports Python strings in unicode.

最后,当您指定要打开的文件时,open函数将假定该文件处于平台默认设置(每个locale.getpreferredencoding()).因此,如果文件编码不同,则应在关键字arg encodingopen func中准确提及它.

And, finally, when you specify a file you'd like to open, the open function assumes that it's in platform default (per locale.getpreferredencoding()). So if your file encoding differs, you should exactly mention it in keyword arg encoding to open func.

这篇关于处理非英文文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆