从python中的大型json文件中获取可读文本 [英] Getting readable text from a large json file in python

查看:22
本文介绍了从python中的大型json文件中获取可读文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的问题.在 python 中,我得到了一个非常大的字符串,其中包含一些文本,例如:\u0398\u03b5\u03b1\u03c4\u03c1\u03b9\u03ba\u03cc

我需要的只是将其转换为可读文本.我正在寻找其他一些帖子,但没有找到任何解决方案.

尝试了以下操作,但没有成功:

rawtext = str(json.dumps(result, indent=2, sort_keys=True))纯文本 = str(原始文本)with open("result.txt", "a+", encoding="utf-8-sig") as f:f.写(纯文本)

这是来自 result.txt 文件的示例:

<代码> {"日期": "\u03a0\u03c1\u03b9\u03bd \u03b1\u03c0\u03cc 3 \u03ce\u03c1\u03b5\u03c2","date_utc": "2020-04-01T16:12:41.903Z","domain": "www.protothema.gr","link": "https://www.protothema.gr/culture/article/991328/menoume-spiti-tzaz-taxidia-apo-to-kedro-politismou-idruma-stauros-niarhos/",位置":1,"片段": "... \u03c4\u03bf \u0398\u03b5\u03b1\u03c4\u03c1\u03b9\u03ba\u03cc \u0391\u03bd\u03b1\u03bb\u03c\u03b\u03b\u03b3c\u03b3c1\u039a\u03ad\u03bd\u03c4\u03c1\u03bf\u03c5\u03a0\u03bf\u03bb\u03b9\u03c4\u03b9\u03c3\u03bc\u03bf\u30c3\u03bc\u03bf\u30c3\u03bf\u3030c\u03bf\u030c3030c3030c3030c3030c3030c30303a\u03b1\u03cd\u03c1\u03bf\u03c2\u039d\u03b9\u03ac\u03c1\u03c7\u03bf\u03c2!\u03a3\u03c4o\u03c0\u03bf\u03c0\u03c0\u03b9\u03ac\u03c1\u03c0\u03c0\u03c0\u03b\u303b\u30c030c030c03u03b4\u03b9\u03b1\u03af\u03bf\u03c4\u03b6\u03b1\u03b6\u03c1\u03b1\u03bd\u03c4\u03b5\u03b2\u03bf\u030f\u03bf\u30f\u30c\u03bf\u30f\u30c\u03bf\u30c\u030c\u03b30c\u303030303u03bf\u03c2\u0388\u03bb\u03bb\u03b7\u03bd\u03b1\u03c2\u03c0\u03b9\u03b1\u03bd\u03af\u03c3\u03c4\u03b7\u03bd\u03b1\u03c2,"来源": "\u03a0\u03c1\u03ce\u03c4\u03bf\u0398\u0395\u039c\u0391","缩略图": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSGRNpdq5fhYi7be2t7UZ-hh-cQvjJqtsnJhN0ShCL7A6DqqPH9aop33FRGTcyfF2gsaU09S-GP&标题":\u00ab\u039c\u03b5\u03bd\u03bf\u03c5\u03bc\u03b5\u03a3\u03c0\u03af\u03c4\u03b9\u00bb:\u03a6\u03b3b1\u03bf\u03c5\u03bc\u03b5\u03b4\u03b9\u03b1\u03b1\u03c0\u03cc\u03c4\u03bf\u039a\u03ad\u03bd\u03c4\u03c1\u03bf\u03a0\u03bf\u30bf\u03bf\u30b3c\u03bf\u30b30c\u030c\u303c."}

解决方案

默认情况下,json 模块会转义所有非 ascii 字符.使用 ensure_ascii=False 保持所有 unicode 字符未转义:

<预><代码>>>>打印(json.dumps("""{"date": "Πριν από 3 ώρες"}"""))"{\"日期\": \"\u03a0\u03c1\u03b9\u03bd \u03b1\u03c0\u03cc 3 \u03ce\u03c1\u03b5\u03c2\"}">>>print(json.dumps("""{"date": "Πριν από 3 ώρες"}""", ensure_ascii=False))"{\"日期\": \"Πριν από 3 ώρες\"}"

在转储数据时只需传递参数:

 with open("result.txt", "a+", encoding="utf-8-sig") as f:json.dump(result, f, indent=2, sort_keys=True)

请注意,就 JSON 标准而言,带有和不带有非 ascii 转义的 JSON 是等效的.

I got an simple question. In python, i got a very large string which got some text like: \u0398\u03b5\u03b1\u03c4\u03c1\u03b9\u03ba\u03cc

All i need, is to convert it into readable text. I was looking in some other posts, but didn't find any solution.

Edit: Tried the following, without any success:

rawtext = str(json.dumps(result, indent=2, sort_keys=True))
puretext = str(rawtext)
with open("result.txt", "a+", encoding="utf-8-sig") as f:
    f.write(puretext)

This is example from the result.txt file:

    {
      "date": "\u03a0\u03c1\u03b9\u03bd \u03b1\u03c0\u03cc 3 \u03ce\u03c1\u03b5\u03c2",
      "date_utc": "2020-04-01T16:12:41.903Z",
      "domain": "www.protothema.gr",
      "link": "https://www.protothema.gr/culture/article/991328/menoume-spiti-tzaz-taxidia-apo-to-kedro-politismou-idruma-stauros-niarhos/",
      "position": 1,
      "snippet": "... \u03c4\u03bf \u0398\u03b5\u03b1\u03c4\u03c1\u03b9\u03ba\u03cc \u0391\u03bd\u03b1\u03bb\u03cc\u03b3\u03b9\u03bf \u03c4\u03bf\u03c5 \u039a\u03ad\u03bd\u03c4\u03c1\u03bf\u03c5 \u03a0\u03bf\u03bb\u03b9\u03c4\u03b9\u03c3\u03bc\u03bf\u03cd \u038a\u03b4\u03c1\u03c5\u03bc\u03b1 \u03a3\u03c4\u03b1\u03cd\u03c1\u03bf\u03c2 \u039d\u03b9\u03ac\u03c1\u03c7\u03bf\u03c2! \u03a3\u03c4o \u03c0\u03c1\u03ce\u03c4o \u03b5\u03b2\u03b4\u03bf\u03bc\u03b1\u03b4\u03b9\u03b1\u03af\u03bf \u03c4\u03b6\u03b1\u03b6 \u03c1\u03b1\u03bd\u03c4\u03b5\u03b2\u03bf\u03cd \u03bf \u03ba\u03bf\u03c1\u03c5\u03c6\u03b1\u03af\u03bf\u03c2 \u0388\u03bb\u03bb\u03b7\u03bd\u03b1\u03c2 \u03c0\u03b9\u03b1\u03bd\u03af\u03c3\u03c4\u03b1\u03c2 \u03c4\u03b7\u03c2 jazz,\u00a0...",
      "source": "\u03a0\u03c1\u03ce\u03c4\u03bf \u0398\u0395\u039c\u0391",
      "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSGRNpdq5fhYi7be2t7UZ-hh-cQvjJqtsnJhN0ShCL7A6DqqPH9aop33FRGTcyfF2gsaU09SG-P&s",
      "title": "\u00ab\u039c\u03b5\u03bd\u03bf\u03c5\u03bc\u03b5 \u03a3\u03c0\u03af\u03c4\u03b9\u00bb: \u03a4\u03b6\u03b1\u03b6 \u03c4\u03b1\u03be\u03af\u03b4\u03b9\u03b1 \u03b1\u03c0\u03cc \u03c4\u03bf \u039a\u03ad\u03bd\u03c4\u03c1\u03bf \u03a0\u03bf\u03bb\u03b9\u03c4\u03b9\u03c3\u03bc\u03bf\u03cd ..."
    }

解决方案

By default, the json module escapes all non-ascii characters. Use ensure_ascii=False to keep all unicode characters unescaped:

>>> print(json.dumps("""{"date": "Πριν από 3 ώρες"}"""))
"{\"date\": \"\u03a0\u03c1\u03b9\u03bd \u03b1\u03c0\u03cc 3 \u03ce\u03c1\u03b5\u03c2\"}"
>>> print(json.dumps("""{"date": "Πριν από 3 ώρες"}""", ensure_ascii=False))
"{\"date\": \"Πριν από 3 ώρες\"}"

Simply pass the parameter when dumping your data:

with open("result.txt", "a+", encoding="utf-8-sig") as f:
    json.dump(result, f, indent=2, sort_keys=True)

Note that JSON with and without non-ascii escaping are equivalent as far as the JSON standard is concerned.

这篇关于从python中的大型json文件中获取可读文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆