json库将空格字符解释为“ \xa0&”; [英] json library interprets space characters as "\xa0"

查看:495
本文介绍了json库将空格字符解释为“ \xa0&”;的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我将json文件加载到python中时,只要文件被视为字符串,编码就不会出现问题。但是,使用文件上的json.load或字符串上的json.loads将文件加载为json格式,所有空格字符都显示为 \xa0

When I load a json-file into python there's no problem with encodings as long as the file is treated as a string. However, loading the file into json-format either using json.load on the file or json.loads on the string all space characters come out as "\xa0"

以下代码产生正常结果,打印出没有任何时髦的 \xa0符号的json-string。

The following code yields normal results, printing the json-string without any funky "\xa0" signs.

with open(json_path) as f:
    lines = f.readlines()
    for line in lines:
        print(line)

将文件加载为json格式,突然将空格字符解释为 \xa0。

Loading the file into json-format and suddently space characters are interpreted as "\xa0".

with open(json_path) as f:
    data = json.load(f)
    print(data.keys())

给出以下内容:


dict_keys(['1.\xa0\lorem\xa0ipsum','2.\xa0\lorem\xa0ipsum\xa0\lorem\xa0ipsum' ,'3.\xa0\lorem','4.\xa0\lorem\xa0ipsum','5.\xa0\lorem\xa0ipsum'])

dict_keys(['1.\xa0\lorem\xa0ipsum', '2.\xa0\lorem\xa0ipsum\xa0\lorem\xa0ipsum', '3.\xa0\lorem', '4.\xa0\lorem\xa0ipsum', '5.\xa0\lorem\xa0ipsum'])

使用json.loads加载字符串而不是文件会得到相同的结果:

Loading the string instead of the file using json.loads gives the same results:

with open(json_path) as f:
    lines = f.read()

data = json.loads(s)
print(data.keys())

我正在使用Java和pdf-box构建pdf解析器。将标题结构解析为我自己的json-tree。我尝试过将json文件转换为Java中的Hashmap,效果很好,因此,关于json文件本身似乎没有任何奇怪之处。这是特定于python的问题,对此有任何解释吗?

I'm building a pdf-parser using java and pdf-box. Parsing the headline structure into my own json-tree. I've tried converting the json-file into Hashmap in java, which works fine so there doesn't seem to be anything weird about the json-file in itself. Is this a python-specific problem and is there any explanation for it?

推荐答案

假定:


  1. 您的JSON文件是有效的,并使用UTF-8作为编码。

  2. 您的JSON文件包含具有不间断空格的键。

然后您得到的输出是完全正确的。

Then the output you get is perfectly correct.

第一段代码读取并打印字符串:

The first piece of code reads and print strings:

with open(json_path) as f:
    lines = f.readlines()
    for line in lines:
        print(line)

打印字符串时,它会输出更多或

When you print a string, it is output more or less unchanged and the non-breaking spaces look the same as a regular space.

第二段代码解析一个JSON文件,从而创建一个字典,然后打印字典键。为了简化说明,我们假设打印字典本身(而不是键):

The second piece of code parses a JSON file thereby creating a dictionary and then prints the dictionary keys. For simplicity of explanation, let's assume the dictionary itself is printed (instead of the keys):

with open(json_path) as f:
    data = json.load(f)
    print(data)

使用字典作为参数调用 print 会调用字典的 __ str __ 函数。 __ str __ 函数使用其自己的规则来设置输出格式,例如

Calling print with a dictionary as an argument invokes the __str__ function of the dictionary. The __str__ function uses it's own rules how to format the output, e.g. it encloses the dictionary in braces, adds single quotes etc.

如果研究输出,您可能会发现打印字典会为字典创建有效的Python代码

If you study the output you might find that printing a dictionary creates valid Python code for a dictionary.

在Python字符串中,某些字符需要转义。转义序列以反斜杠开始。一个典型的例子是换行符:

In Python strings, certain characters need to be escaped. And the escape sequence starts with a backslash. A typical example would be a newline character:

d = {'line1\nline2': 3}
print(d)

输出:

{'line1\nline2': 3}

__ str __ 字典逻辑显然是也要转义不间断的空格,因为否则它们在视觉上无法与常规空格区分开(即使这不是严格必要的)。在Python中对其进行转义的正确方法是 \a0

Part of __str__ dictionary logic obviously is to also escape non-breaking spaces as they otherwise cannot be visually distinguished from a regular space (even though this is not strictly necessary). And the proper way to escape it in Python is \a0.

因此,一切工作都按设计进行。这是一个功能,而不是错误。

So everything works as designed. It's a feature, not a bug.

这篇关于json库将空格字符解释为“ \xa0&”;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆