Python UTF8字符串混淆 [英] Python UTF8 string confusion

查看：144 发布时间：2017/5/29 21:02:51 python django unicode

本文介绍了Python UTF8字符串混淆的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在这一段时间里，我的头脑一直冲了过去，我读了一堆文章，问题并不清楚。我有一堆字符串存储在我的数据库中，想象如下：

Been banging my head on this for a while and I've read a bunch of articles and the issue isn't any clearer. I have a bunch of strings stored in my database, imagine the following:

x = '\xd0\xa4'
y = '\x92'

在Python shell中我得到以下内容：

At the Python shell I get the following:

print x
Ф
print y
?

正是我想看到的是什么然而，有以下几点：

Which is exactly what I want to see. However then there is the following:

print unicode(x, 'utf8')
Ф

但不是这样：

unicode(y, 'utf8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: unexpected code byte

我的感觉是，我们的字符串变得越来越糟糕，因为Django尝试将它们转换为unicode，但我只是猜测在这一点上。赞赏的任何见解或解决方法。

My feeling is that our strings are getting mangled because Django tries to convert them to unicode, but I'm just guessing at this point. Any insights or workarounds appreciated.

更新：当我查看包含\x92值的行的数据库时，看到这个字符为'。撇号。我使用Unicode UTF-8编码查看数据库的内容。

UPDATE: When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding.

推荐答案

看起来你有一个打字错误;应该是 x ='\xd0\xa4'。如果您使用实际运行的副本粘贴和输出中出现的内容，它将非常有帮助。

Looks like you have a typo; should be x = '\xd0\xa4'. It helps very much if you use copy paste of what you actually ran and what appeared on the output.

\x92不是有效的UTF-8字符串。这解释了你得到的例外。

"\x92" is not a valid UTF-8 string. This explains the exception that you got.

更多的谜题是为什么 print y production 。你在叫什么Python控制台？它似乎以替换模式运行，而代之以？你确定这是一个简单的？而不是白色的？里面有一颗黑色钻石？你为什么这么说？正是你期待看到的？

More of a puzzle is why print y produced ?. What are you calling "the Python console"?? It appears to be operating in "replace" mode and substituting "?" ... are you sure that it's a plain "?" and not a white "?" inside a black diamond? Why do you say that "?" is exactly what you expect to see?

更新：你现在说当我查看包含'\x92'的值，我看到这个字符为'撇号，我使用Unicode UTF-8编码查看数据库的内容。

UPDATE: You now say """When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding."""

这不是撇号。似乎这块数据已经使用cp125X（又名Windows-125X）编码之一编码。说明使用cp1252（通常的嫌疑犯）：

That's not an apostrophe. It seems that that piece of data has been encoded using one of the cp125X (aka windows-125X) encodings. Illustrating using cp1252 (the usual suspect):

IDLE 2.6.4      
>>> import unicodedata
>>> uc = '\x92'.decode('cp1252')
>>> print repr(uc)
u'\u2019'
>>> print uc
’
>>> unicodedata.name(uc)
'RIGHT SINGLE QUOTATION MARK'
>>>

而不是使用Unicode UTF-8编码查看数据库的内容（无论如何），尝试编写一小段Python代码来提取违规字符串，然后执行 print repr（bad_string）。向我们显示您运行的代码，再加上repr（）的输出。还要告诉我们哪个版本的Python，什么平台（基于Windows或者基于unix的）以及什么版本的什么数据库软件。而CREATE TABLE语句的一部分与有关列相关。

Instead of "viewing the contents of the database using a Unicode UTF-8 encoding" (whatever that means), try writing a small snippet of Python code to extract the offending string and then do print repr(bad_string). Show us the code that you ran, plus the output of the repr(). Also tell us which version of Python, what platform (Windows or unix-based), and what version of what database software. And the part of the CREATE TABLE statement relevant to the column in question.

另请阅读此和此。

这篇关于Python UTF8字符串混淆的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python UTF8字符串混淆 [英] Python UTF8 string confusion

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python UTF8字符串混淆 [英] Python UTF8 string confusion

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭