递归遍历JSON文件以提取选定的字符串 [英] Recursive walk through a JSON file extracting SELECTED strings

查看:296
本文介绍了递归遍历JSON文件以提取选定的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要递归地遍历JSON文件(来自API的响应),提取以["text"]为键的字符串{"text":"this is a string"}

I need to recursively walk through a JSON files (post responses from an API), extracting the strings that have ["text"] as a key {"text":"this is a string"}

我需要从元数据中具有最旧日期的源开始解析,从该源中提取字符串,然后移至第二最旧的源,依此类推. JSON文件可能嵌套不好,字符串的级别可能会不时更改.

I need to start to parse from the source that has the oldest date in metadata, extract the strings from that source and then move to the 2nd oldest source and so on. JSON file could be badly nested and the level where the strings are can change from time to time.

问题: 有很多称为["text"]的键,我不需要所有键,只需要将值作为字符串的键即可.更好的是,我需要的"text":"string"始终位于"type":"sentence"的同一对象{}中.查看图片.

Problem: There are many keys called ["text"] and I don't need all of them, I need ONLY the ones having values as string. Better, the "text":"string" I need are ALWAYS in the same object {} of a "type":"sentence". See image.

我在问什么

修改下面的第二个代码,以便递归地遍历文件,并仅在[object]值与"type":"sentence"一起位于同一对象{}中时才提取它们.

Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".

一小段JSON文件(绿色为我需要的文本和medatada,红色为我不需要提取的文本):

Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):

链接到完整的JSON示例: http://pastebin.com/0NS5BiDk

Link to full JSON sample: http://pastebin.com/0NS5BiDk

到目前为止我所做的:

1)简单的方法:将json文件转换为字符串,然后在双引号(")之间搜索内容,因为在所有json帖子响应中,我需要的字符串"是双引号之间唯一的字符串.但是,此选项使我无法提前订购资源,因此还不够好.

1) The easy way: transform the json file in string and search for content between the double quotes ("") because in all json post responses the "strings" I need are the only ones that come between double quotes. However this option prevent me to order the resources previously, therefore is not good enough.

r1 = s.post(url2, data=payload1)
j = str(r1.json())

sentences_list = (re.findall(r'\"(.+?)\"', j))

numentries = 0
for sentences in sentences_list:
    numentries += 1
    print(sentences)
    print(numentries)

2)更聪明的方法:递归地遍历JSON文件并提取["text"]值

2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values

def get_all(myjson, key):
    if type(myjson) is dict:
        for jsonkey in (myjson):
            if type(myjson[jsonkey]) in (list, dict):
                get_all(myjson[jsonkey], key)
            elif jsonkey == key:
                print (myjson[jsonkey])
    elif type(myjson) is list:
        for item in myjson:
            if type(item) in (list, dict):
                get_all(item, key)

print(get_all(r1.json(), "text"))

它将提取所有以["text"]作为键的值.不幸的是,文件中还有其他(我不需要)以["text"]作为键的东西.因此它返回不需要的文本.

It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.

请告知.

更新

我已经编写了2个代码,以按某个键对对象列表进行排序.第一个按xml的文本"排序.第二个是包含期间自"值.

I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.

第一个可以工作,但是一些XML(即使数量更多)实际上包含的文档比我预期的要早.

The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.

对于第二个代码,包含期限自"的格式不一致,有时甚至根本不存在该值.第二个也给我一个错误,但我不知道为什么-string indices must be integers.

For the 2nd code the format of 'Comprising period from' is not consistent and sometimes the value is not present at all. The second one also gives me an error, but I cannot figure out why - string indices must be integers.

# 1st code (it works but not ideal)

j=r1.json()

list = []
for row in j["tree"]["children"][0]["children"]:
    list.append(row)

newlist = sorted(list, key=lambda k: k['text'][-9:])
print(newlist)

# 2nd code I need something to expect missing values and to solve the
# list index error
list = []
for row in j["tree"]["children"][0]["children"]:
    list.append(row)

def date(key):
    return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)

def order(list_to_order):
    try:
        return sorted(list_to_order,
                      key=lambda k: k[date(["metadata"][0]["value"])])
    except ValueError:
        return 0

print(order(list))

推荐答案

我认为,只要选择正确的字符串,这就能满足您的要求.我还更改了类型检查的方式,以使用 isinstance() ,因为它支持面向对象的多态性,因此被认为是一种更好的方法.

I think this will do what you want, as far as selecting the right strings. I also changed the way type-checking was done to use isinstance(), which is considered a better way to do it because it supports object-oriented polymorphism.

import json
_NUL = object()  # unique value guaranteed to never be in JSON data

def get_all(myjson, kind, key):
    """ Recursively find all the values of key in all the dictionaries in myjson
        with a "type" key equal to kind.
    """
    if isinstance(myjson, dict):
        key_value = myjson.get(key, _NUL)  # _NUL if key not present
        if key_value is not _NUL and myjson.get("type") == kind:
            yield key_value
        for jsonkey in myjson:
            jsonvalue = myjson[jsonkey]
            for v in get_all(jsonvalue, kind, key):  # recursive
                yield v
    elif isinstance(myjson, list):
        for item in myjson:
            for v in get_all(item, kind, key):  # recursive
                yield v    

with open('json_sample.txt', 'r') as f:
    data = json.load(f)

numentries = 0
for text in get_all(data, "sentence", "text"):
    print(text)
    numentries += 1

print('\nNumber of "text" entries found: {}'.format(numentries))

这篇关于递归遍历JSON文件以提取选定的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆