从复杂的JSON文件中提取信息最有效的方法是什么? [英] What is the most efficient way to extract info from complex JSON files?

查看:450
本文介绍了从复杂的JSON文件中提取信息最有效的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python的新手,正在从dict文件中提取某些信息。



我有数百万个JSON文件存储文本数据。所有JSON文件都具有相似的结构。结构方面有很多变化。对于每个JSON文件,我想从特定的键中提取所有的文本字符串,并将其存储为dict。



json1 json2 是简化的例子。我一直在做的是从JSON文件中抽取样本,分析它们,编写大量的if语句,试图包含所有可能的变体。但是,我觉得效率低下,但仍然无法包含所有的场景。我想知道是否有一个一般的方式来搜索并使用密钥text

  json1 = {
section:{
heading:{lvl:A1,text:today},
段落:[
{color:green,text:yesterday},
{color:purple,text:tomorrow}
]
}
}

json2 = {
段:{text:everyday,color:black}
}

换句话说,我想得到一个包含所有文本字符串的dict, 文字的关键。对于 json1 ,我想获得 {json1:今天昨天明天} 。对于 json2 ,我想获得 {json2:everyday}



非常感谢任何帮助。

解决方案

如果您不知道其他任何内容,结构可以相当随意你的意思,那么你必须访问每个节点并检查。这可以通过使用递归的通用方式来实现。这是一个实现它的快速和脏的功能:

 在[4]中:def extract_text(obj,acc): 
...:if isinstance(obj,dict):
...:for k,v in obj.items():
...:if isinstance(v,(dict ,列表)):
...:extract_text(v,acc)
...:elif k ==text:
...:acc.append(v)
...:elif isinstance(obj,list):
...:对于obj中的项目:
...:extract_text(item,acc)
...:

以下是如何使用它:

 在[5]中:acc1 = [] 

在[6]中:extract_text(json1,acc1)

在[7 ]:acc1
出[7]:['昨天','明天','今天']

在[8]中:acc2 = []

在[9]中:extract_text(json2,acc2)

在[10]中:acc2
输出[10]:['everyday']

请注意,您的问题没有任何真正的与JSON相关的事情,这是一种基于文本的数据序列化格式。您已经在处理反序列化数据和python数据结构。无论如何,如果你真的想要你的问题的结果,你可以简单地做:

 在[11]中: {json1:,。join(acc1)} 
Out [11]:{'json1':'昨天,明天,今天'}

或者您喜欢加入的任何分隔符,就像一个简单的空格:

 在[12]中:{json1:.join(acc1)} 
输出[12]:{'json1':'昨天明天今天'}


I am new to Python and am working on extracting certain information from dict files.

I have millions of JSON files that store text data. All JSON files have similar structures. There are a lot of variations in terms of structure. For each JSON file, I want to extract all of the text strings from a particular key and store them as a dict.

json1 and json2 below are simplified examples. What I have been doing is to take a sample from the JSON files, analyze them, write a lot of if-statements with an attempt to include all of the possible variations. However, I find it inefficient and am still not able to include all of the scenarios. I wonder if there's a general way to search and extract the values using the key "text".

json1 = {
        "section": {
                   "heading":{"lvl":"A1", "text":"today"},
                   "paragraph":[
                                {"color":"green",  "text":"yesterday"},
                                {"color":"purple", "text":"tomorrow"}
                               ]
                   }
         }

json2 = {
        "paragraph":{"text":"everyday", "color": "black"}
        }

In other words, I want to get a dict that contains all the text strings with a key of "text." For json1, I want to get {"json1":"today yesterday tomorrow"}. For json2, I want to get {"json2":"everyday"}.

Any help is greatly appreciated.

解决方案

If you don't know anything else, and the structure can be rather arbitrary as you imply, then you have to visit every node and check. This can be achieved in a generic way using recursion. Here is a quick-and-dirty function to achieve it:

In [4]: def extract_text(obj, acc):
    ...:     if isinstance(obj, dict):
    ...:         for k, v in obj.items():
    ...:             if isinstance(v, (dict, list)):
    ...:                 extract_text(v, acc)
    ...:             elif k == "text":
    ...:                 acc.append(v)
    ...:     elif isinstance(obj, list):
    ...:         for item in obj:
    ...:             extract_text(item, acc)
...:       

Here is how you would use it:

In [5]: acc1 = []

In [6]: extract_text(json1, acc1)

In [7]: acc1
Out[7]: ['yesterday', 'tomorrow', 'today']

In [8]: acc2 = []

In [9]: extract_text(json2, acc2)

In [10]: acc2
Out[10]: ['everyday']

Note, your question doesn't really have anything to do with JSON, which is a text-based data serialization format. You are already dealing with deserialized data and python data structures. In any event, if you really want the result you have in your question, you can simply do:

In [11]: {"json1": ",".join(acc1)}
Out[11]: {'json1': 'yesterday,tomorrow,today'}

Or whatever separator you prefer to join on, like a simple space:

In [12]: {"json1": " ".join(acc1)}
Out[12]: {'json1': 'yesterday tomorrow today'}

这篇关于从复杂的JSON文件中提取信息最有效的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆