将json.dumps中的utf-8文本保存为UTF8,而不是\u转义序列 [英] Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

查看:5054
本文介绍了将json.dumps中的utf-8文本保存为UTF8,而不是\u转义序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

示例代码:

>>> import json
>>> json_string = json.dumps("ברי צקלה")
>>> print json_string
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"

问题:它不是人类可读的。我的(智能)用户想通过JSON转储验证甚至编辑文本文件。 (我宁愿不使用XML)

The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps. (and i'd rather not use XML)

有没有办法将对象序列化为utf-8 json字符串(而不是\uXXXX)?

Is there a way to serialize objects into utf-8 json string (instead of \uXXXX ) ?

这没有帮助:

>>> output = json_string.decode('string-escape')
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"

这个的作品,但是如果任何子对象是一个python-unicode而不是utf-8,它会转储垃圾:

this works, but if any sub-objects is a python-unicode and not utf-8, it'll dump garbage:

>>> #### ok:
>>> s= json.dumps( "ברי צקלה", ensure_ascii=False)    
>>> print json.loads(s)   
ברי צקלה

>>> #### NOT ok:
>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> print d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 
 2: u'\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94'}
>>> s = json.dumps( d, ensure_ascii=False, encoding='utf8')
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
××¨× ×¦×§××

我搜索了json.dumps文档,但找不到有用的东西。

i searched the json.dumps documentation but couldn't find something useful.

我会尝试总结Martijn Pieters的评论和答案:

i'll try to sum up the comments and answers by Martijn Pieters:

(编辑:@ Sebastian的评论之后的第二个想法,大约一年稍后)

(edit: 2nd thought after @Sebastian's comment and about a year later)


  1. 可能不是 内置在json.dumps中的解决方案。

  1. there might be no is a built-in solution in json.dumps.

在JSON对象之前,我必须将所有字符串转换为 UTF8 Unicode对象。

i'll have to convert all strings to UTF8 Unicode the object before it's being JSON-ed. i'll use Mark's function that converts strings recuresively in a nested object

我给出的示例取决于我的电脑和IDE环境,并且在所有计算机上都不能运行。

谢谢大家:)

推荐答案

使用 ensure_ascii = False 切换到 json.dumps(),然后手动将值编码为UTF-8:

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:

>>> json_string = json.dumps(u"ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print json_string
"ברי צקלה"

如果你正在写一个文件,你可以使用 io.open() 而不是 open()来编写一个文件对象,为您编写Unicode值,然后使用 json.dump()而不是写入该文件:

If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:

with io.open('filename', 'w', encoding='utf8') as json_file:
    json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

在Python 3中,内置的 open() io.open()。请注意, json 模块中的错误,其中 ensure_ascii = False 标志可以产生 unicode str 对象。 Python 2的解决方法是:

In Python 3, the built-in open() is an alias for io.open(). Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:

with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(u"ברי צקלה", ensure_ascii=False)
    # unicode(data) auto-decodes data to unicode if str
    json_file.write(unicode(data))

如果您传递字节字符串(键入 str 在Python 2, bytes 在Python 3)编码为UTF-8,确保也设置编码关键字:

If you are passing in byte strings (type str in Python 2, bytes in Python 3) encoded to UTF-8, make sure to also set the encoding keyword:

>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}

>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה

请注意, strong>您的第二个示例是有效的Unicode;你把UTF-8字节作为unicode字面值,这将永远不会工作:

Note that your second sample is not valid Unicode; you gave it UTF-8 bytes as a unicode literal, that would never work:

>>> s = u'\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94'
>>> print s
××¨× ×¦×§××
>>> print s.encode('latin1').decode('utf8')
ברי צקלה

只有当我将该字符串编码为拉丁文1(其Unicode码编码点映射为1比特到字节),然后解码为UTF-8,您会看到预期的输出。这与JSON无关,所有这些都与您使用错误的输入有关。结果称为 Mojibake

Only when I encoded that string to Latin 1 (whose unicode codepoints map one-to-one to bytes) then decode as UTF-8 do you see the expected output. That has nothing to do with JSON and everything to do with that you use the wrong input. The result is called a Mojibake.

如果你有Unicode值来自字符串字面值,它使用错误的编解码器进行了解码。这可能是您的终端配置错误,或者您的文本编辑器使用与您使用Python读取文件的方式不同的编解码器来保存源代码。或者您从应用错误编解码器的库中采购它。 这与JSON库无关

If you got that Unicode value from a string literal, it was decoded using the wrong codec. It could be your terminal is mis-configured, or that your text editor saved your source code using a different codec than what you told Python to read the file with. Or you sourced it from a library that applied the wrong codec. This all has nothing to do with the JSON library.

这篇关于将json.dumps中的utf-8文本保存为UTF8,而不是\u转义序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆