如何从JSON获取字符串对象而不是Unicode? [英] How to get string objects instead of Unicode from JSON?
问题描述
我正在使用 Python 2 从 ASCII编码文本文件中解析JSON.
I'm using Python 2 to parse JSON from ASCII encoded text files.
使用 json
或 simplejson
,我所有的字符串值都转换为Unicode对象而不是字符串对象.问题是,我必须将数据与仅接受字符串对象的某些库一起使用.我无法更改库,也无法对其进行更新.
When loading these files with either json
or simplejson
, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can't change the libraries nor update them.
是否可以获取字符串对象而不是Unicode对象?
Is it possible to get string objects instead of Unicode ones?
>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b'] # I want these to be of type `str`, not `unicode`
更新
很久以前,当我坚持使用 Python 2 时,这个问题就被问到了.对于当今而言,一种简单易用的解决方案是使用最新版本的Python,即 Python 3 及更高版本.
Update
This question was asked a long time ago, when I was stuck with Python 2. One easy and clean solution for today is to use a recent version of Python — i.e. Python 3 and forward.
推荐答案
使用object_hook
的解决方案
A solution with object_hook
import json
def json_load_byteified(file_handle):
return _byteify(
json.load(file_handle, object_hook=_byteify),
ignore_dicts=True
)
def json_loads_byteified(json_text):
return _byteify(
json.loads(json_text, object_hook=_byteify),
ignore_dicts=True
)
def _byteify(data, ignore_dicts = False):
# if this is a unicode string, return its string representation
if isinstance(data, unicode):
return data.encode('utf-8')
# if this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item, ignore_dicts=True) for item in data ]
# if this is a dictionary, return dictionary of byteified keys and values
# but only if we haven't already byteified it
if isinstance(data, dict) and not ignore_dicts:
return {
_byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
for key, value in data.iteritems()
}
# if it's anything else, return it in its original form
return data
示例用法:
>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}
这是如何工作的,我为什么要使用它?
Mark Amery的功能比这些功能更短更清晰,那么它们的意义何在?您为什么要使用它们?
How does this work and why would I use it?
Mark Amery's function is shorter and clearer than these ones, so what's the point of them? Why would you want to use them?
纯粹是为了获得效果. Mark的答案首先使用Unicode字符串完全解码JSON文本,然后遍历整个解码值以将所有字符串转换为字节字符串.这会带来一些不良影响:
Purely for performance. Mark's answer decodes the JSON text fully first with unicode strings, then recurses through the entire decoded value to convert all strings to byte strings. This has a couple of undesirable effects:
- 在内存中创建了整个解码结构的副本
- 如果您的JSON对象是 really 深度嵌套(500个级别或更多),则您将达到Python的最大递归深度
- A copy of the entire decoded structure gets created in memory
- If your JSON object is really deeply nested (500 levels or more) then you'll hit Python's maximum recursion depth
此答案通过使用json.load
和json.loads
的object_hook
参数来缓解这两个性能问题.来自文档:
This answer mitigates both of those performance issues by using the object_hook
parameter of json.load
and json.loads
. From the docs:
object_hook
是一个可选函数,它将被解码的任何对象文字(adict
)的结果调用.将使用object_hook的返回值代替dict
.此功能可用于实现自定义解码器
object_hook
is an optional function that will be called with the result of any object literal decoded (adict
). The return value of object_hook will be used instead of thedict
. This feature can be used to implement custom decoders
由于在其他字典中嵌套了许多层次的字典在解码时传递给了object_hook
,因此我们可以在此时对其中的任何字符串或列表进行字节化,而无需进行深度递归以后.
Since dictionaries nested many levels deep in other dictionaries get passed to object_hook
as they're decoded, we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later.
Mark的答案不适合用作object_hook
,因为它递归为嵌套词典.我们在_byteify
参数中使用ignore_dicts
参数来防止该递归,当object_hook
将其传递给新的dict
进行字节化时,该参数将始终传递给 except . ignore_dicts
标志告诉_byteify
忽略dict
,因为它们已经被字节化了.
Mark's answer isn't suitable for use as an object_hook
as it stands, because it recurses into nested dictionaries. We prevent that recursion in this answer with the ignore_dicts
parameter to _byteify
, which gets passed to it at all times except when object_hook
passes it a new dict
to byteify. The ignore_dicts
flag tells _byteify
to ignore dict
s since they already been byteified.
最后,我们的json_load_byteified
和json_loads_byteified
的实现对json.load
或json.loads
返回的结果调用_byteify
(带有ignore_dicts=True
)来处理解码的JSON文本不正确的情况在顶层有dict
.
Finally, our implementations of json_load_byteified
and json_loads_byteified
call _byteify
(with ignore_dicts=True
) on the result returned from json.load
or json.loads
to handle the case where the JSON text being decoded doesn't have a dict
at the top level.
这篇关于如何从JSON获取字符串对象而不是Unicode?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!