字符编码，XML，Excel，Python [英] Character Encoding, XML, Excel, python

查看：95 发布时间：2020/7/13 5:09:32 python excel encoding utf-8

本文介绍了字符编码，XML，Excel，Python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在读取从另一个软件程序导入到excel xml文件中的字符串的列表.我不确定excel文件的编码是什么，但是我可以确定它不是Windows-1252，因为当我尝试使用该编码时，会遇到很多错误.

I am reading a list of strings that were imported into an excel xml file from another software program. I am not sure what the encoding of the excel file is, but I am pretty sure its not windows-1252, because when I try to use that encoding, I wind up with a lot of errors.

目前引起我麻烦的特定单词是:Zmysłowska，Magdalena"(请注意，"l"不是标准的"l"，而是斜线).

The specific word that is causing me trouble right now is: "Zmysłowska, Magdalena" (notice the "l" is not a standard "l", but rather, has a slash through it).

我尝试了几件事，在这里我要提到其中三件事:

I have tried a few things, Ill mention three of them here:

(1)

page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
page = page.encode("utf-8", "ignore")

Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: ZmysÅ‚owska, Magdalena

(2)

page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)

Output: Zmys\u0142owska, Magdalena
Output after print statment: Zmysłowska, Magdalena

Note: this is great, but I need to encode it back to utf-8 before putting the string into my     db.  When I do that, by running page.encode("utf-8", "ignore"), I end up with ZmysÅ‚owska, Magdalena again.

(3) 不执行任何操作(不规范，不解码，不编码).看来该字符串在输入时已经是utf-8了.但是，当我什么也不做时，该字符串再次以以下输出结束:

(3) Do nothing (no normalization, no decode, no encode). It seems like the string is already utf-8 when it comes in. However, when I do nothing, the string ends up with the following output again:

Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: ZmysÅ‚owska, Magdalena

有没有办法将这个字符串转换为utf-8?

Is there a way for me to convert this string to utf-8?

推荐答案

您的问题不是您的编码和解码.您的代码正确地采用了UTF-8字符串，并将其转换为NFKD标准化的UTF-8字符串. (您可能想使用page.decode("utf-8")代替unicode(page, "utf-8")只是为了将来验证，以防万一您使用Python 3，并且使代码更易于阅读，因为encode和decode更明显并行，但您不必这样做；两者是等效的.)

Your problem isn't your encoding and decoding. Your code correctly takes a UTF-8 string, and converts it to an NFKD-normalized UTF-8 string. (You might want to use page.decode("utf-8") instead of unicode(page, "utf-8") just for future-proofing in case you ever go to Python 3, and to make the code a bit easier to read because the encode and decode are more obviously parallel, but you don't have to; the two are equivalent.)

您的实际问题是，您正在将UTF-8字符串打印到不是UTF-8的某些上下文中.您很可能要打印到cmd窗口，该窗口默认为Windows-1252.因此，cmd试图将UTF-8字符解释为Windows-1252，并得到了垃圾.

Your actually problem is that you're printing UTF-8 strings to some context that isn't UTF-8. Most likely you're printing to the cmd window, which is defaulting to Windows-1252. So, cmd tries to interpret the UTF-8 characters as Windows-1252, and gets garbage.

有一种非常简单的方法可以测试这一点.让Python像Windows-1252一样解码UTF-8字符串，并查看生成的Unicode字符串是否看起来像所看到的.

There's a pretty easy way to test this. Make Python decode the UTF-8 string as if it were Windows-1252 and see if the resulting Unicode string looks like what're seeing.

>>> print page.decode('windows-1252')
ZmysÅ‚owska, Magdalena

>>> print repr(page.decode('windows-1252'))
u'Zmys\xc5\u201aowska, Magdalena'

有两种解决方法:

打印Unicode字符串，然后让Python处理.
打印转换为适当编码的字符串.

对于选项1:

print page.decode("utf-8") # of unicode(page, "utf-8")

对于选项2，它将是以下之一:

For option 2, it's going to be one of the following:

print page.decode("utf-8").encode("windows-1252")
print page.decode("utf-8").encode(sys.getdefaultencoding())

当然，如果您保留中间的Unicode字符串，则不需要所有这些decode调用:

Of course if you keep the intermediate Unicode string around, you don't need all those decode calls:

upage = page.decode("utf-8")
upage = unicodedata.normalize("NFKD", upage)
page = upage.encode("utf-8", "ignore")

print upage

这篇关于字符编码，XML，Excel，Python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

字符编码，XML，Excel，Python [英] Character Encoding, XML, Excel, python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

字符编码，XML，Excel，Python [英] Character Encoding, XML, Excel, python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭