字符编码,XML,Excel,Python [英] Character Encoding, XML, Excel, python
问题描述
我正在读取从另一个软件程序导入到excel xml文件中的字符串的列表.我不确定excel文件的编码是什么,但是我可以确定它不是Windows-1252,因为当我尝试使用该编码时,会遇到很多错误.
I am reading a list of strings that were imported into an excel xml file from another software program. I am not sure what the encoding of the excel file is, but I am pretty sure its not windows-1252, because when I try to use that encoding, I wind up with a lot of errors.
目前引起我麻烦的特定单词是:Zmysłowska,Magdalena"(请注意,"l"不是标准的"l",而是斜线).
The specific word that is causing me trouble right now is: "Zmysłowska, Magdalena" (notice the "l" is not a standard "l", but rather, has a slash through it).
我尝试了几件事,在这里我要提到其中三件事:
I have tried a few things, Ill mention three of them here:
(1)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
page = page.encode("utf-8", "ignore")
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena
(2)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
Output: Zmys\u0142owska, Magdalena
Output after print statment: Zmysłowska, Magdalena
Note: this is great, but I need to encode it back to utf-8 before putting the string into my db. When I do that, by running page.encode("utf-8", "ignore"), I end up with Zmysłowska, Magdalena again.
(3) 不执行任何操作(不规范,不解码,不编码).看来该字符串在输入时已经是utf-8了.但是,当我什么也不做时,该字符串再次以以下输出结束:
(3) Do nothing (no normalization, no decode, no encode). It seems like the string is already utf-8 when it comes in. However, when I do nothing, the string ends up with the following output again:
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena
有没有办法将这个字符串转换为utf-8?
Is there a way for me to convert this string to utf-8?
推荐答案
您的问题不是您的编码和解码.您的代码正确地采用了UTF-8字符串,并将其转换为NFKD标准化的UTF-8字符串. (您可能想使用page.decode("utf-8")
代替unicode(page, "utf-8")
只是为了将来验证,以防万一您使用Python 3,并且使代码更易于阅读,因为encode
和decode
更明显并行,但您不必这样做;两者是等效的.)
Your problem isn't your encoding and decoding. Your code correctly takes a UTF-8 string, and converts it to an NFKD-normalized UTF-8 string. (You might want to use page.decode("utf-8")
instead of unicode(page, "utf-8")
just for future-proofing in case you ever go to Python 3, and to make the code a bit easier to read because the encode
and decode
are more obviously parallel, but you don't have to; the two are equivalent.)
您的实际问题是,您正在将UTF-8字符串打印到不是UTF-8的某些上下文中.您很可能要打印到cmd
窗口,该窗口默认为Windows-1252.因此,cmd
试图将UTF-8字符解释为Windows-1252,并得到了垃圾.
Your actually problem is that you're printing UTF-8 strings to some context that isn't UTF-8. Most likely you're printing to the cmd
window, which is defaulting to Windows-1252. So, cmd
tries to interpret the UTF-8 characters as Windows-1252, and gets garbage.
有一种非常简单的方法可以测试这一点.让Python像Windows-1252一样解码UTF-8字符串,并查看生成的Unicode字符串是否看起来像所看到的.
There's a pretty easy way to test this. Make Python decode the UTF-8 string as if it were Windows-1252 and see if the resulting Unicode string looks like what're seeing.
>>> print page.decode('windows-1252')
Zmysłowska, Magdalena
>>> print repr(page.decode('windows-1252'))
u'Zmys\xc5\u201aowska, Magdalena'
有两种解决方法:
- 打印Unicode字符串,然后让Python处理.
- 打印转换为适当编码的字符串.
对于选项1:
print page.decode("utf-8") # of unicode(page, "utf-8")
对于选项2,它将是以下之一:
For option 2, it's going to be one of the following:
print page.decode("utf-8").encode("windows-1252")
print page.decode("utf-8").encode(sys.getdefaultencoding())
当然,如果您保留中间的Unicode字符串,则不需要所有这些decode
调用:
Of course if you keep the intermediate Unicode string around, you don't need all those decode
calls:
upage = page.decode("utf-8")
upage = unicodedata.normalize("NFKD", upage)
page = upage.encode("utf-8", "ignore")
print upage
这篇关于字符编码,XML,Excel,Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!