解码JSON字符串中的UTF-8编码 [英] Decode UTF-8 encoding in JSON string

查看:263
本文介绍了解码JSON字符串中的UTF-8编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个JSON文件,其中包含以下编码的字符串:

I have JSON file which contains followingly encoded strings:

"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",

"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",

我正在尝试使用json模块来解析此文件.但是,我无法正确解码此字符串.

I am trying to parse this file using the json module. However I am not able to decode this string correctly.

使用.load()方法解码JSON后得到的是'HornÃ\xadková'.该字符串应正确解码为'Horníková'.

What I get after decoding the JSON using .load() method is 'HornÃ\xadková'. The string should be correctly decoded as 'Horníková' instead.

我阅读了JSON规范,然后发现\u之后应该有4个十六进制数字来指定字符的 Unicode数字.但是似乎在这个JSON文件中, UTF-8编码的字节被存储为\u-序列.

I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.

这是什么类型的编码,以及如何在Python 3中正确解析?

What type of encoding is this and how to correctly parse it in Python 3?

根据规范,此类型的JSON文件甚至是有效的JSON文件吗?

Is this type JSON file even valid JSON file according to the specification?

推荐答案

您的文本已被编码,您需要通过在字符串中使用b前缀将其告知Python,但由于您使用的是json和输入需要为字符串,您必须手动解码编码的文本.由于您输入的内容不是字节,因此可以使用'raw_unicode_escape'编码将字符串转换为不进行编码的字节,并防止open方法使用其自己的默认编码.然后,您可以简单地使用上述方法来获得所需的结果.

Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.

请注意,由于您需要进行编码和解码,因此必须读取文件内容并在加载的字符串上执行编码,因此您应该使用

Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().

In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
     ...:     d = json.loads(f.read().encode('raw_unicode_escape').decode())
     ...:     

In [169]: d
Out[169]: {'sender_name': 'Horníková'}

这篇关于解码JSON字符串中的UTF-8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆