Python编码 - 无法解码到utf8 [英] Python Encoding - Could not decode to utf8

查看:402
本文介绍了Python编码 - 无法解码到utf8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由外部程序填充的sqlite数据库。我试图用python读取数据。当我尝试读取数据时,会收到以下错误:

I have an sqlite database that was populated by an external program. Im trying to read the data with python. When I attempt to read the data I get the following error:

OperationalError:无法解码为UTF-8

如果我在sqlite管理器中打开数据库,并使用内置的浏览和搜索查看违规记录中的数据,它看起来不错,但是如果我将表导出为csv,我注意到在冒犯记录中的字符£已变成

If I open the database in sqlite manager and look at the data in the offending record(s) using the inbuilt browse and search it looks fine, however if I export the table as csv, I notice the character £ in the offending records has become £

如果我在python中读取csv,作为£,但它不是一个问题,我可以手动解析这个。但是我需要能够直接从数据库读取数据,而不需要转换为csv的中间步骤。

If I read the csv in python, the £ in the offending records is still read as £ but its not a problem I can parse this manually. However I need to be able to read the data direct from the database, without the intermediate step of converting to csv.

我已经在网上查看了一些类似的问题,我已经尝试设置text_factory = str,我也尝试使用sqlite manager将列的数据类型从TEXT更改为BLOB,但仍然会得到错误。

I have looked at some answers online for similar questions, I have so far tried setting "text_factory = str" and I have also tried changing the datatype of the column from TEXT to BLOB using sqlite manager, but still get the error.

我的代码如下:OperationalError:无法解码为UTF-8

My code below results in the OperationalError: Could not decode to UTF-8

conn = sqlite3.connect('test.db')
conn.text_factory = str
curr = conn.cursor()
curr.execute('''SELECT xml_dump FROM hands_1 LIMIT  5000  , 5001''')
row = curr.fetchone()

数据库中5000以上的所有记录都有这个字符问题,错误。

All the records above 5000 in the database have this character problem and hence produce the error.

任何帮助感谢。

推荐答案

通过将多个文本(作为字节存储在数据库中)转换为python str 对象是有帮助的。为了做这个转换,python必须猜测你的查询返回的每个字节(或字节组)表示什么字母。默认的猜测是一个称为utf-8的编码。显然,这种猜测在你的情况下是错误的。

Python is trying to be helpful by converting pieces of text (stored as bytes in a database) into a python str object for you. In order to do this conversion, python has to guess what letter each byte (or group of bytes) returned by your query represents. The default guess is an encoding called utf-8. Obviously, this guess is wrong in your case.

解决方案是给python一点提示如何做从字节到字母的映射(即unicode字符)。

The solution is to give python a little hint as to how to do the mapping from bytes to letters (i.e., unicode characters). You've already come close with the line

conn.text_factory = str

然而(根据你在上面的评论中的回复),因为你使用的是python 3, str 默认文本工厂,因此该行不会为您执行任何操作(请参阅文档)。

However (based on your response in the comments above), since you are using python 3, str is the default text factory, so that line will do nothing new for you (see the docs).

这一行后面会发生的事情是,python试图使用 str function,kind of like:

What happens behind the scenes with this line is that python tries to convert the bytes returned by the query using the str function, kind of like:

your_string = str(the_bytes, 'utf-8') # actually uses `conn.text_factory`, not `str`

...但是你想要一个不同的编码'utf-8'是。由于您不能更改 str 函数的默认编码,因此您必须以其他方式模拟它。您可以使用一个名为 lambda 的一次性无名函数:

...but you want a different encoding where 'utf-8' is. Since you can't change the default encoding of the str function, you will have to mimic it some other way. You can use a one-off nameless function called a lambda for this:

conn.text_factory = lambda x: str(x, 'latin1')



现在当数据库将字节传递给python ,python将尝试使用'latin1'方案而不是'utf-8'方案将它们映射到字母。当然,我不知道latin1是否是你的数据的正确编码。实际上,你将不得不尝试一些编码找到正确的一个。我将尝试以下第一:

Now when the database is handing the bytes to python, python will try to map them to letters using the 'latin1' scheme instead of the 'utf-8' scheme. Of course, I don't know if latin1 is the correct encoding of your data. Realistically, you will have to try a handful of encodings to find the right one. I would try the following first:


  • 'iso-8859-1'

  • 'utf-16'

  • 'utf-32'

  • 'latin1'

  • 'iso-8859-1'
  • 'utf-16'
  • 'utf-32'
  • 'latin1'

您可以在此处

You can find a more complete list here.

另一个选择是简单地让数据库中的字节保留为字节。这是一个好主意,取决于你的应用程序。您可以设置:

Another option is to simply let the bytes coming out of the database remain as bytes. Whether this is a good idea for you depends on your application. You can do it by setting:

conn.text_factory = bytes

这篇关于Python编码 - 无法解码到utf8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆