将json.dumps中的utf-8文本保存为UTF8,而不是\u转义序列 [英] Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
问题描述
示例代码:
>>> import json
>>> json_string = json.dumps("ברי צקלה")
>>> print json_string
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"
问题:它不是人类可读的。我的(智能)用户想通过JSON转储验证甚至编辑文本文件。 (我宁愿不使用XML)
The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps. (and i'd rather not use XML)
有没有办法将对象序列化为utf-8 json字符串(而不是\uXXXX)?
Is there a way to serialize objects into utf-8 json string (instead of \uXXXX ) ?
这没有帮助:
>>> output = json_string.decode('string-escape')
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"
这个的作品,但是如果任何子对象是一个python-unicode而不是utf-8,它会转储垃圾:
this works, but if any sub-objects is a python-unicode and not utf-8, it'll dump garbage:
>>> #### ok:
>>> s= json.dumps( "ברי צקלה", ensure_ascii=False)
>>> print json.loads(s)
ברי צקלה
>>> #### NOT ok:
>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> print d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94',
2: u'\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94'}
>>> s = json.dumps( d, ensure_ascii=False, encoding='utf8')
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
××¨× ×¦×§××
我搜索了json.dumps文档,但找不到有用的东西。
i searched the json.dumps documentation but couldn't find something useful.
我会尝试总结Martijn Pieters的评论和答案:
i'll try to sum up the comments and answers by Martijn Pieters:
(编辑:@ Sebastian的评论之后的第二个想法,大约一年稍后)
(edit: 2nd thought after @Sebastian's comment and about a year later)
-
有
可能不是是内置在json.dumps中的解决方案。
there
might be nois a built-in solution in json.dumps.
在JSON对象之前,我必须将所有字符串转换为 UTF8 Unicode对象。
i'll have to convert all strings to UTF8 Unicode the object before it's being JSON-ed.
i'll use Mark's function that converts strings recuresively in a nested object
我给出的示例取决于我的电脑和IDE环境,并且在所有计算机上都不能运行。
谢谢大家:)
推荐答案
使用 ensure_ascii = False
切换到 json.dumps()
,然后手动将值编码为UTF-8:
Use the ensure_ascii=False
switch to json.dumps()
, then encode the value to UTF-8 manually:
>>> json_string = json.dumps(u"ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print json_string
"ברי צקלה"
如果你正在写一个文件,你可以使用 io.open()
而不是 open()
来编写一个文件对象,为您编写Unicode值,然后使用 json.dump()
而不是写入该文件:
If you are writing this to a file, you can use io.open()
instead of open()
to produce a file object that encodes Unicode values for you as you write, then use json.dump()
instead to write to that file:
with io.open('filename', 'w', encoding='utf8') as json_file:
json.dump(u"ברי צקלה", json_file, ensure_ascii=False)
在Python 3中,内置的 open()
是 io.open()$ c $的别名C>。请注意,
json
模块中的错误,其中 ensure_ascii = False
标志可以产生 unicode
和 str
对象。 Python 2的解决方法是:
In Python 3, the built-in open()
is an alias for io.open()
. Do note that there is a bug in the json
module where the ensure_ascii=False
flag can produce a mix of unicode
and str
objects. The workaround for Python 2 then is:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(u"ברי צקלה", ensure_ascii=False)
# unicode(data) auto-decodes data to unicode if str
json_file.write(unicode(data))
如果您传递字节字符串(键入 str
在Python 2, bytes
在Python 3)编码为UTF-8,确保也设置编码
关键字:
If you are passing in byte strings (type str
in Python 2, bytes
in Python 3) encoded to UTF-8, make sure to also set the encoding
keyword:
>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}
>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה
请注意, strong>您的第二个示例是不有效的Unicode;你把UTF-8字节作为unicode字面值,这将永远不会工作:
Note that your second sample is not valid Unicode; you gave it UTF-8 bytes as a unicode literal, that would never work:
>>> s = u'\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94'
>>> print s
××¨× ×¦×§××
>>> print s.encode('latin1').decode('utf8')
ברי צקלה
只有当我将该字符串编码为拉丁文1(其Unicode码编码点映射为1比特到字节),然后解码为UTF-8,您会看到预期的输出。这与JSON无关,所有这些都与您使用错误的输入有关。结果称为 Mojibake 。
Only when I encoded that string to Latin 1 (whose unicode codepoints map one-to-one to bytes) then decode as UTF-8 do you see the expected output. That has nothing to do with JSON and everything to do with that you use the wrong input. The result is called a Mojibake.
如果你有Unicode值来自字符串字面值,它使用错误的编解码器进行了解码。这可能是您的终端配置错误,或者您的文本编辑器使用与您使用Python读取文件的方式不同的编解码器来保存源代码。或者您从应用错误编解码器的库中采购它。 这与JSON库无关。
If you got that Unicode value from a string literal, it was decoded using the wrong codec. It could be your terminal is mis-configured, or that your text editor saved your source code using a different codec than what you told Python to read the file with. Or you sourced it from a library that applied the wrong codec. This all has nothing to do with the JSON library.
这篇关于将json.dumps中的utf-8文本保存为UTF8,而不是\u转义序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!