如何使用Python读取Excel Unicode字符 [英] How to read excel Unicode characters using Python
问题描述
我收到一个Excel文件,其内容无法影响.它包含一些Unicode字符,例如á"或é".
I am receiving an Excel file whose content I cannot influence. It contains some Unicode characters like "á" or "é".
我的代码没有改变,但是我从Eclipse Juno迁移到LiClipse,一起迁移到另一个python包(从2.5升级到2.6).原则上,我使用的特定程序包在win32com程序包上有一个可用的版本.
My code has been unchanged, but I migrated from Eclipse Juno to LiClipse together to a migration to a different python package (2.6 from 2.5). In principle the specific package I am using has a working version on win32com package.
当我读取Excel文件时,使用str()提取并转换为字符串时,我的代码崩溃了.控制台输出如下:
When I read the Excel file my code is crashing when extracting and converting to to strings using str(). The console output is the following:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 89: ordinal not in range(128)
更具体地讲,我执行以下操作:
Being more concrete I perform the following:
阅读Excel:
xlApp = Dispatch("Excel.Application")
excel = xlApp.Workbooks.Open(excel_location)
在内部循环中,我提取单元格的值:
in an internal loop I extract the value of the cell:
cell_value = self.excel.ActiveSheet.Cells(excel_line + 1, excel_column + 1)
最后,如果我尝试将cell_value转换为str,则会崩溃:
and finally, if I try to convert cell_value to str, crashes:
print str(cell_value)
如果我转到Excel并删除非ASCII字符,则说明一切正常.我已经尝试过编码提案 .我用Google搜索的其他任何解决方案都建议将文件保存为我无法做到的特定格式.
If I go to the Excel and remove the non-ASCII characters everything is working smoothly. I have tried this encode proposal. Any other solution I have googled proposes saving the file in a specific format, that I can't do.
让我感到困惑的是,以前代码是使用相同的输入Excel进行工作的,但是对LiClipse和2.6 Python的更改杀死了所有内容.
What puzzles me is that the code was working before with the same input Excel but this change to LiClipse and 2.6 Python killed everything.
有什么想法可以进步吗?
Any idea how can I progress?
推荐答案
在Python 2.x中使用UTF-8编码的Unicode数据时,这是一个常见问题.在2.4和2.7之间的某些地方已更改了此处理方式,因此突然出现错误也就不足为奇了.
This is a common problem when working with UTF-8 encoded Unicode data in Python 2.x. The handling of this has changed in a few places between 2.4 and 2.7, so it's no surprise that you suddenly get an error.
错误的根源是print
:在Python 2.x中,print
不会尝试假设您的终端支持哪种编码.它只是起到保存作用,并假定ascii
是唯一受支持的字符集(这意味着0到127之间的字符是可以的,其他所有情况都会产生错误).
The source of the error is print
: In Python 2.x, print
doesn't try to assume what encoding your terminal supports. It just plays save and assumes that ascii
is the only supported charset (which means characters between 0 and 127 are fine, everything else gives an error).
现在将COMObject
转换为字符串.就Python 2.x而言,str
只是一堆字节(值0到255).它没有编码.
Now you convert a COMObject
to a string. str
is just a bunch of bytes (values 0 to 255) as far as Python 2.x is concerned. It doesn't have an encoding.
将两者结合是麻烦的秘诀.当Python打印时,它尝试验证输入(字符串)并突然发现UTF-8编码的字符(UTF-8添加了这些奇数的\xe1
标记,这告诉解码器下一个字节在某种程度上是特殊的; 在Wikipedia上查看血腥细节).
Combining the two is a recipe for trouble. When Python prints, it tries to validate the input (the string) and suddenly finds UTF-8 encoded characters (UTF-8 adds these odd \xe1
markers which tells the decoder that the next byte is special in some way; check Wikipedia for the gory details).
那是ascii
编码器说的:对不起,不能在那帮您.
That's when the ascii
encoder says: Sorry, can't help you there.
这意味着您可以使用该值,进行类似的比较,但不能print
.解决打印问题的简单方法是:
That means you can work with this value, compare it and such, but you can't print
it. A simple fix for the printing problem is:
s = str(cell_value) # Convert COM -> UTF-8 encoded string
print repr(s) # repr() converts anything to ascii
如果您的终端支持UTF-8,则需要告知Python:
If your terminal supports UTF-8, then you need to tell Python about it:
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
您还应该看看sys.stdout.encoding
,它告诉Python当前认为/应该是什么输出编码.如果正确配置了Python 2(例如在现代Linux发行版中),则应自动使用正确的输出编解码器.
You should also have a look at sys.stdout.encoding
which tells what Python currently thinks the output encoding is/should be. When Python 2 is properly configured (like on modern Linux distributions), then the correct codec for output should be used automatically.
相关:
- Python 2 Unicode howto
- Pragmatic Unicode, or, How do I stop the pain?
- Setting the correct encoding when piping stdout in Python
这篇关于如何使用Python读取Excel Unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!