字符编码,XML,Excel,Python [英] Character Encoding, XML, Excel, python

查看:95
本文介绍了字符编码,XML,Excel,Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读取从另一个软件程序导入到excel xml文件中的字符串的列表.我不确定excel文件的编码是什么,但是我可以确定它不是Windows-1252,因为当我尝试使用该编码时,会遇到很多错误.

I am reading a list of strings that were imported into an excel xml file from another software program. I am not sure what the encoding of the excel file is, but I am pretty sure its not windows-1252, because when I try to use that encoding, I wind up with a lot of errors.

目前引起我麻烦的特定单词是:Zmysłowska,Magdalena"(请注意,"l"不是标准的"l",而是斜线).

The specific word that is causing me trouble right now is: "Zmysłowska, Magdalena" (notice the "l" is not a standard "l", but rather, has a slash through it).

我尝试了几件事,在这里我要提到其中三件事:

I have tried a few things, Ill mention three of them here:

(1)

page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
page = page.encode("utf-8", "ignore")

Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena

(2)

page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)

Output: Zmys\u0142owska, Magdalena
Output after print statment: Zmysłowska, Magdalena

Note: this is great, but I need to encode it back to utf-8 before putting the string into my     db.  When I do that, by running page.encode("utf-8", "ignore"), I end up with Zmysłowska, Magdalena again.

(3) 不执行任何操作(不规范,不解码,不编码).看来该字符串在输入时已经是utf-8了.但是,当我什么也不做时,该字符串再次以以下输出结束:

(3) Do nothing (no normalization, no decode, no encode). It seems like the string is already utf-8 when it comes in. However, when I do nothing, the string ends up with the following output again:

Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena

有没有办法将这个字符串转换为utf-8?

Is there a way for me to convert this string to utf-8?

推荐答案

您的问题不是您的编码和解码.您的代码正确地采用了UTF-8字符串,并将其转换为NFKD标准化的UTF-8字符串. (您可能想使用page.decode("utf-8")代替unicode(page, "utf-8")只是为了将来验证,以防万一您使用Python 3,并且使代码更易于阅读,因为encodedecode更明显并行,但您不必这样做;两者是等效的.)

Your problem isn't your encoding and decoding. Your code correctly takes a UTF-8 string, and converts it to an NFKD-normalized UTF-8 string. (You might want to use page.decode("utf-8") instead of unicode(page, "utf-8") just for future-proofing in case you ever go to Python 3, and to make the code a bit easier to read because the encode and decode are more obviously parallel, but you don't have to; the two are equivalent.)

您的实际问题是,您正在将UTF-8字符串打印到不是UTF-8的某些上下文中.您很可能要打印到cmd窗口,该窗口默认为Windows-1252.因此,cmd试图将UTF-8字符解释为Windows-1252,并得到了垃圾.

Your actually problem is that you're printing UTF-8 strings to some context that isn't UTF-8. Most likely you're printing to the cmd window, which is defaulting to Windows-1252. So, cmd tries to interpret the UTF-8 characters as Windows-1252, and gets garbage.

有一种非常简单的方法可以测试这一点.让Python像Windows-1252一样解码UTF-8字符串,并查看生成的Unicode字符串是否看起来像所看到的.

There's a pretty easy way to test this. Make Python decode the UTF-8 string as if it were Windows-1252 and see if the resulting Unicode string looks like what're seeing.

>>> print page.decode('windows-1252')
Zmysłowska, Magdalena

>>> print repr(page.decode('windows-1252'))
u'Zmys\xc5\u201aowska, Magdalena'

有两种解决方法:

  1. 打印Unicode字符串,然后让Python处理.
  2. 打印转换为适当编码的字符串.

对于选项1:

print page.decode("utf-8") # of unicode(page, "utf-8")

对于选项2,它将是以下之一:

For option 2, it's going to be one of the following:

print page.decode("utf-8").encode("windows-1252")
print page.decode("utf-8").encode(sys.getdefaultencoding())

当然,如果您保留中间的Unicode字符串,则不需要所有这些decode调用:

Of course if you keep the intermediate Unicode string around, you don't need all those decode calls:

upage = page.decode("utf-8")
upage = unicodedata.normalize("NFKD", upage)
page = upage.encode("utf-8", "ignore")

print upage

这篇关于字符编码,XML,Excel,Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆