Python UTF8字符串混淆 [英] Python UTF8 string confusion

查看:144
本文介绍了Python UTF8字符串混淆的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这一段时间里,我的头脑一直冲了过去,我读了一堆文章,问题并不清楚。我有一堆字符串存储在我的数据库中,想象如下:

Been banging my head on this for a while and I've read a bunch of articles and the issue isn't any clearer. I have a bunch of strings stored in my database, imagine the following:

x = '\xd0\xa4'
y = '\x92'

在Python shell中我得到以下内容:

At the Python shell I get the following:

print x
Ф
print y
?

正是我想看到的是什么然而,有以下几点:

Which is exactly what I want to see. However then there is the following:

print unicode(x, 'utf8')
Ф

但不是这样:

unicode(y, 'utf8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: unexpected code byte

我的感觉是,我们的字符串变得越来越糟糕,因为Django尝试将它们转换为unicode,但我只是猜测在这一点上。赞赏的任何见解或解决方法。

My feeling is that our strings are getting mangled because Django tries to convert them to unicode, but I'm just guessing at this point. Any insights or workarounds appreciated.

更新:当我查看包含\x92值的行的数据库时,看到这个字符为'。撇号。我使用Unicode UTF-8编码查看数据库的内容。

UPDATE: When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding.

推荐答案

看起来你有一个打字错误;应该是 x ='\xd0\xa4'。如果您使用实际运行的副本粘贴和输出中出现的内容,它将非常有帮助。

Looks like you have a typo; should be x = '\xd0\xa4'. It helps very much if you use copy paste of what you actually ran and what appeared on the output.

\x92不是有效的UTF-8字符串。这解释了你得到的例外。

"\x92" is not a valid UTF-8 string. This explains the exception that you got.

更多的谜题是为什么 print y production 。你在叫什么Python控制台?它似乎以替换模式运行,而代之以?你确定这是一个简单的?而不是白色的?里面有一颗黑色钻石?你为什么这么说 ?正是你期待看到的?

More of a puzzle is why print y produced ?. What are you calling "the Python console"?? It appears to be operating in "replace" mode and substituting "?" ... are you sure that it's a plain "?" and not a white "?" inside a black diamond? Why do you say that "?" is exactly what you expect to see?

更新:你现在说当我查看包含'\x92'的值,我看到这个字符为'撇号,我使用Unicode UTF-8编码查看数据库的内容。

UPDATE: You now say """When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding."""

这不是撇号。似乎这块数据已经使用cp125X(又名Windows-125X)编码之一编码。说明使用cp1252(通常的嫌疑犯):

That's not an apostrophe. It seems that that piece of data has been encoded using one of the cp125X (aka windows-125X) encodings. Illustrating using cp1252 (the usual suspect):

IDLE 2.6.4      
>>> import unicodedata
>>> uc = '\x92'.decode('cp1252')
>>> print repr(uc)
u'\u2019'
>>> print uc
’
>>> unicodedata.name(uc)
'RIGHT SINGLE QUOTATION MARK'
>>> 

而不是使用Unicode UTF-8编码查看数据库的内容(无论如何),尝试编写一小段Python代码来提取违规字符串,然后执行 print repr(bad_string)。向我们显示您运行的代码,再加上repr()的输出。还要告诉我们哪个版本的Python,什么平台(基于Windows或者基于unix的)以及什么版本的什么数据库软件。而CREATE TABLE语句的一部分与有关列相关。

Instead of "viewing the contents of the database using a Unicode UTF-8 encoding" (whatever that means), try writing a small snippet of Python code to extract the offending string and then do print repr(bad_string). Show us the code that you ran, plus the output of the repr(). Also tell us which version of Python, what platform (Windows or unix-based), and what version of what database software. And the part of the CREATE TABLE statement relevant to the column in question.

另请阅读

这篇关于Python UTF8字符串混淆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆