如何腌制unicodes并将其保存在utf-8数据库中 [英] How to pickle unicodes and save them in utf-8 databases

查看:192
本文介绍了如何腌制unicodes并将其保存在utf-8数据库中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



数据库可以是例如一个字典,可能包含unicode,例如

p>

  data = {1:u'é'} 

,数据库(mysql)在utf-8中。



当我腌制时,

  import pickle 
pickled_data = pickle.dumps(data)
打印类型(pickled_data)#返回< type'str'>

所得到的pickled_data是一个字符串。



当我尝试将其存储在数据库中(例如在Textfield中)时,可能会导致问题。特别是,我在某个时候得到一个

  UnicodeDecodeError'utf8'编解码器无法解码位置X中的字节0xe9尝试在数据库中保存pickled_data时,

这是有道理的,因为pickled_data可以有非utf-8个字符。我的问题是如何在utf-8数据库中存储pickled_data?



我看到两个可能的候选人:


  1. 将pickle.dump的结果编码为utf-8并存储。当我想要pickle.load,我必须解码它。


  2. 存储酸洗字符串二进制格式(如何?),这强制所有字符在ascii中。


我的问题是,我没有看到选择其中一个选项的后果是什么从长远来看。由于改变已经需要一些努力,所以我被要求在这个问题上提出意见,要求最终更好的候选人。



(PS这是比较有用的 Django

解决方案

Pickle数据是不透明的,二进制数据,即使使用协议版本0:

 >>> pickle.dumps(data,0)
'(dp0\\\
I1\\\
V\xe9\\\
p1\\\
s。'

当您尝试将其存储在 TextField 中时,Django将尝试将该数据解码为UTF8以将其存储;这就是失败,因为这不是UTF-8编码的数据;它是二进制数据:

 >>&pickle_data.decode ('utf8')
追溯(最近的最后一次呼叫):
文件< stdin>,第1行< module>
文件/ Users / mj / Development /第16行,解码
返回codecs.utf_8_decode(输入,错误,True)
UnicodeDecodeError:'utf8'编解码器可以在第9位解码字节0xe9:无效的连续字节

解决方案是 / strong>尝试将其存储在 TextField 中。使用 BinaryField


存储原始二进制数据的字段。它只支持字节赋值。请注意,此字段的功能有限。例如,不可能在BinaryField值上过滤查询集。


您有一个 / code> value(Python 2字符串是字符串,在Python 3中重命名为 bytes



如果您坚持将数据存储在文本字段中,请将其明确解码为 latin1 ;拉丁语1编解码器将字节一对一映射到Unicode码点:

 >>> pickled_data.decode('latin1')
u'(dp0\\\
I1\\\
V\xe9\\\
p1\\\
s。'

,并确保您再次解开之前再次编码:

 >>> encoded = pickled_data.decode('latin1')
>>>  pickle.loads(编码)
追溯(最近的最后一次呼叫):
文件< stdin>,第1行< module>
文件/Users/mj/Development/Libraries/buildout.python/parts/opt/lib/python2.7/pickle.py ,第1381行,加载
file = StringIO(str)
UnicodeEncodeError:'ascii'编码解码器不能在字符9'中编码字符u'\xe9':ordinal不在范围内(128)
>>> pickle.loads(encoded.encode('latin1'))
{1:u'\xe9'}

请注意,如果您将此值转到浏览器并在文本字段中再次返回,则浏览器可能会替换该数据中的字符。Internet Explorer将替换 \\ \\ n \r\\\
的字符,例如,因为它假定它正在处理文本。



在任何情况下,您不应该允许从网络连接中接收泡菜数据,因为这是一个等待开发的安全漏洞


I have a database (mysql) where I want to store pickled data.

The data can be for instance a dictionary, which may contain unicode, e.g.

data = {1 : u'é'}

and the database (mysql) is in utf-8.

When I pickle,

import pickle
pickled_data = pickle.dumps(data)
print type(pickled_data) # returns <type 'str'>

the resulting pickled_data is a string.

When I try to store this in a database (e.g. in a Textfield) this can causes problems. In particular, I'm getting at some point a

UnicodeDecodeError "'utf8' codec can't decode byte 0xe9 in position X"

when trying to save the pickled_data in the database. This makes sense because pickled_data can have non-utf-8 characters. My question is how do I store pickled_data on a utf-8 database?

I see two possible candidates:

  1. Encode the result of the pickle.dump to utf-8 and store it. When I want to pickle.load, I have to decode it.

  2. Store the pickled string in binary format (how?), which forces all characters to be within ascii.

My issue is that I'm not seeing what are the consequences of choosing one of this options in the long run. Since the change already requires some effort, I'm driven to ask for an opinion on this issue, asking for eventual better candidates.

(P.S. This is for instance useful in Django)

解决方案

Pickle data is opaque, binary data, even when you use protocol version 0:

>>> pickle.dumps(data, 0)
'(dp0\nI1\nV\xe9\np1\ns.'

When you try to store that in a TextField, Django will try to decode that data to UTF8 to store it; this is what fails because this is not UTF-8 encoded data; it is binary data instead:

>>> pickled_data.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 9: invalid continuation byte

The solution is to not try to store this in a TextField. Use a BinaryField instead:

A field to store raw binary data. It only supports bytes assignment. Be aware that this field has limited functionality. For example, it is not possible to filter a queryset on a BinaryField value.

You have a bytes value (Python 2 strings are byte strings, renamed to bytes in Python 3).

If you insist on storing the data in a text field, explicitly decode it as latin1; the Latin 1 codec maps bytes one-on-one to Unicode codepoints:

>>> pickled_data.decode('latin1')
u'(dp0\nI1\nV\xe9\np1\ns.'

and make sure you encode it again before unpickling again:

>>> encoded = pickled_data.decode('latin1')
>>> pickle.loads(encoded)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/Libraries/buildout.python/parts/opt/lib/python2.7/pickle.py", line 1381, in loads
    file = StringIO(str)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 9: ordinal not in range(128)
>>> pickle.loads(encoded.encode('latin1'))
{1: u'\xe9'}

Do note that if you let this value go to the browser and back again in a text field, the browser is likely to have replaced characters in that data. Internet Explorer will replace \n characters with \r\n, for example, because it assumes it is dealing with text.

Not that you ever should allow accepting pickle data from a network connection in any case, because that is a security hole waiting for exploitation.

这篇关于如何腌制unicodes并将其保存在utf-8数据库中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆