在Python中平铺通用JSON列表或列表 [英] Flattening Generic JSON List of Dicts or Lists in Python

查看:324
本文介绍了在Python中平铺通用JSON列表或列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组任意的JSON数据,已经在Python中解析为不同深度的列表和列表。我需要能够将这个压扁成一个列表。示例如下:



源数据示例1

  [{u'industry':[
{u'id':u'112',u'name':u'A'},
{u'id':u'132 ',你的名字':u'B'},
{u'id':u'110',u'name':u'C'},
],
u 'name':u'materials'},
{u'industry':{u'id':u'210',u'name':u'A'},
u'name' :u'conglomerates'}
]

所需结果示例1

  [{u'name':u'materials',u'industry_id':u'112',u'industry_name ':u'A'},
{u'name':u'materials',u'industry_id':u'132',u'industry_name':u'B'},
{u 'name':u'materials',u'industry_id':u'110',u'industry_name':u'C'},
{u'name':u'conglomerates',u'industry_id' u'210',u'industry_name':u'A'},
]

这个简单的例子很简单,但是我并不总是有这样一个精确的列表结构,还有一层附加的列表。在某些情况下,我可能需要采用相同的方法来进行嵌套。因此,我想我将需要递归,我似乎无法让它上​​班。



建议方法



1)对于每个Dict列表,使用提供父键名称的路径来填充每个键。在上面的例子中,行业是包含列表的关键,所以列表中的每一个孩子都被添加到行业中。



2)在列表中的每个单词中添加父项目 - 在这种情况下,名称和行业是顶级列表中的项目,因此将名称键/值添加到行业中的项目。



我可以想象一些场景,您在父项项目中有多个列表,甚至是滴滴的列表,并应用这些列表儿童列表中的子树将不起作用。因此,我假设'父'项目总是简单的键/值对。



另外一个例子来试图说明数据结构中的潜在变异需要处理。



源数据示例2

  [{u'industry':[
{u'id':u'112',u'name':u'A'},
{u'id': u'132',u'name':u'B'},
{u'id':u'110',u'name':u'C',u'company':[
'u'''''''''''''''''''''''你'''''''''''''''''$'


''u'materials'} ,
{u'industry':{u'id':u'210',u'name':u'A'},
u'name':u'conglomerates'}
]

结果示例2

  [{u'name':u'materials',u'industry_id':u' 112',u'industry_name':u'A'},
{u'name':u'materials',u'industry_id':u'132',u'industry_name':u'B'},
{u'name':u'materials',u'industry_id':u'110',u'industry_name':u'C',
u'company_id':'500',u' company_symbol':'X'},
{u'name':u'materials',u'industry_id':u'110',u'industry_name':u'C',
u'company_id ':'502',u'company_symbol':'Y'},
{u'name':u'materials',u'industry_id':u'110',u'industry_name':u'C' ,
u'company_id':'504',u'company_symbol':'Z'},
{u'name':u'conglomerates',u'industry_id':u'210',u 'industry_name':u'A'},
]

我看了几个examp我似乎找不到一个适用于这些例子的例子。



任何建议或指针?我花了一些时间尝试构建递归函数来处理这个问题,几个小时后没有运气...



更新一个失败的尝试

  def _flatten(sub_tree,flattened = [],path =,parent_dict = {},child_dict = {}) :
如果type(sub_tree)是列表:
为我在sub_tree中:
flattened.append(_flatten(i,
flattened = flattened,
path = path,
parent_dict = parent_dict,
child_dict = child_dict


返回flattened
elif类型(sub_tree)是dict:
lists = {
new_parent_dict = {}
new_child_dict = {}
为键,sub_tree.items()中的值:
new_path = path +'_'+ key
i f类型(值)是dict:
为key2,value2在value.items()中:
new_path2 = new_path +'_'+ key2
new_parent_dict [new_path2] = value2
elif类型(值)是unicode:
new_parent_dict [key] = value
elif type(value)is list:
lists [new_path] = value
new_parent_dict.update(parent_dict)
为key,value在lists.items()中:
for i in value:
flattened.append(_flatten(i,
flattened = flattened,
path =键
parent_dict = new_parent_dict,


return flattened

我得到的结果是一个231x231矩阵的无对象 - 显然我遇到麻烦的递归运行



我尝试了一些额外的从头开始的尝试,失败了类似的失败模式。

解决方案

好的。我的解决方案有两个功能。第一个 splitObj ,负责将对象拆分成平面数据和子列表或子对象,稍后将需要递归。第二个 flatten 实际上迭代一个对象列表,进行递归调用,并且负责重建每个迭代的最终对象。

  def splitObj(obj,prefix = None):
'''
拆分对象,返回一个3元组的平面对象,可选的
后面是子对象的键和这些子对象的列表。
'''
#复制对象,可选地在每个键之前添加前缀
new = obj.copy(),如果前缀为无{{} _ {}'。format前缀,k):v for k,v in obj.items()}

#尝试找到保存子对象的键或子对象的列表
for k,v in new。 item():
#子对象的列表
如果isinstance(v,list):
del new [k]
return new,k,v
#或just一个子对象
elif isinstance(v,dict):
del new [k]
return new,k,[v]
return new,None,none

def flatten(数据,前缀=无):
'''
平滑数据,可选地每个键前缀。
'''
#对数据中的项目迭代所有项

#拆分对象
flat,key,subs = splitObj(item,prefix)

#只返回完全平面对象
如果键为None:
yield flat
continue

#否则递归地平铺子对象
(sub,key):
sub.update(flat)
yield sub

请注意,这并不能完全产生所需的输出。原因是您的输出实际上是不一致的。在第二个例子中,对于这些行业中嵌套的公司的情况,嵌套在输出中是不可见的。因此,我的输出将生成 industry_company_id industry_company_symbol

 >>> ex1 = [{u'industry':[{u'id':u'112',u'name':u'A'},
{u'id':u'132',u'name ':u'B'},
{u'id':u'110',u'name':u'C'}],
u'name':u'materials'},
{u'industry':{u'id':u'210',u'name':u'A'},u'name':u'conglomerates'}]
>> ;> ex2 = [{u'industry':[{u'id':u'112',u'name':u'A'},
{u'id':u'132',u'name ':u'B'},
{u'company':[{u'id':'500',u'symbol':'X'},
{u'id':' 502',你的符号':'Y'},
{u'id':'504',u'symbol':'Z'}],
u'id':u'110 ',
u'name':u'C'}],
u'name':u'materials'},
{u'industry':{u'id':u '210',你的名字'u'A'},u'name':u'conglomerates'}]

>>> pprint(list(flatten(ex1)))
[{'industry_id':u'112','industry_name':u'A',u'name':u'materials'},
{ 'industry_id':u'132','industry_name':u'B',u'name':u'materials'},
{'industry_id':u'110','industry_name':u'C ',你的名字':u'materials'},
{'industry_id':u'210','industry_name':u'A',u'name':u'conglomerates'}]
>>> pprint(list(flatten(ex2)))
[{'industry_id':u'112','industry_name':u'A',u'name':u'materials'},
{ 'industry_id':u'132','industry_name':u'B',u'name':u'materials'},
{'industry_company_id':'500',
'industry_company_symbol' 'X',
'industry_id':u'110',
'industry_name':u'C',
u'name':u'materials'},
{ 'industry_company_id':'502',
'industry_company_symbol':'Y',
'industry_id':u'110',
'industry_name':u'C',
u'name':u'materials'},
{'industry_company_id':'504',
'industry_company_symbol':'Z',
'industry_id':u'110',
'industry_name':u'C',
u'name':u'materials'},
{'industry_id':u'210','industry_name':u'A'你的名字'u'conglomerates'}]


I have a set of arbitrary JSON data that has been parsed in Python to lists of dicts and lists of varying depth. I need to be able to 'flatten' this into a list of dicts. Example below:

Source Data Example 1

[{u'industry': [
   {u'id': u'112', u'name': u'A'},
   {u'id': u'132', u'name': u'B'},
   {u'id': u'110', u'name': u'C'},
   ],
  u'name': u'materials'},
 {u'industry': {u'id': u'210', u'name': u'A'},
  u'name': u'conglomerates'}
]

Desired Result Example 1

[{u'name':u'materials', u'industry_id':u'112', u'industry_name':u'A'},
 {u'name':u'materials', u'industry_id':u'132', u'industry_name':u'B'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C'},
 {u'name':u'conglomerates', u'industry_id':u'210', u'industry_name':u'A'},
]

This is easy enough for this simple example, but I don't always have this exact structure of list o f dicts, with one additional layer of list of dicts. In some cases, I may have additional nesting that needs to follow the same methodology. As a result, I think I will need recursion and I cannot seem to get this to work.

Proposed Methodology

1) For Each List of Dicts, prepend each key with a 'path' that provides the name of the parent key. In the example above, 'industry' was the key which contained a list of dicts, so each of the children dicts in the list have 'industry' added to them.

2) Add 'Parent' Items to Each Dict within List - in this case, the 'name' and 'industry' were the items in the top level list of dicts, and so the 'name' key/value was added to each of the items in 'industry'.

I can imagine some scenarios where you had multiple lists of dicts or even dicts of dicts in the 'Parent' items and applying each of these sub-trees to the children list of dicts would not work. As a result, I'll assume that the 'parent' items are always simple key/value pairs.

One more example to try to illustrate the potential variabilities in data structure that need to be handled.

Source Data Example 2

[{u'industry': [
   {u'id': u'112', u'name': u'A'},
   {u'id': u'132', u'name': u'B'},
   {u'id': u'110', u'name': u'C', u'company': [
                            {u'id':'500', u'symbol':'X'},
                            {u'id':'502', u'symbol':'Y'},
                            {u'id':'504', u'symbol':'Z'},
                  ]
   },
   ],
  u'name': u'materials'},
 {u'industry': {u'id': u'210', u'name': u'A'},
  u'name': u'conglomerates'}
]

Desired Result Example 2

[{u'name':u'materials', u'industry_id':u'112', u'industry_name':u'A'},
 {u'name':u'materials', u'industry_id':u'132', u'industry_name':u'B'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C', 
                        u'company_id':'500', u'company_symbol':'X'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C', 
                        u'company_id':'502', u'company_symbol':'Y'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C', 
                        u'company_id':'504', u'company_symbol':'Z'},
 {u'name':u'conglomerates', u'industry_id':u'210', u'industry_name':u'A'},
]

I have looked at several other examples and I can't seem to find one that works in these example cases.

Any suggestions or pointers? I've spent some time trying to build a recursive function to handle this with no luck after many hours...

UPDATED WITH ONE FAILED ATTEMPT

def _flatten(sub_tree, flattened=[], path="", parent_dict={}, child_dict={}):
    if type(sub_tree) is list:
        for i in sub_tree:
            flattened.append(_flatten(i,
                                      flattened=flattened,
                                      path=path,
                                      parent_dict=parent_dict,
                                      child_dict=child_dict
                                      )
                            )
        return flattened
    elif type(sub_tree) is dict:
        lists = {}
        new_parent_dict = {}
        new_child_dict = {}
        for key, value in sub_tree.items():
            new_path = path + '_' + key
            if type(value) is dict:
                for key2, value2 in value.items():
                    new_path2 = new_path + '_' + key2
                    new_parent_dict[new_path2] = value2
            elif type(value) is unicode:
                new_parent_dict[key] = value
            elif type(value) is list:
                lists[new_path] = value
        new_parent_dict.update(parent_dict)
        for key, value in lists.items():
            for i in value:
                flattened.append(_flatten(i,
                                      flattened=flattened,
                                      path=key,
                                      parent_dict=new_parent_dict,
                                      )
            )
        return flattened

The result I get is a 231x231 matrix of 'None' objects - clearly I'm getting into trouble with the recursion running away.

I've tried a few additional 'start from scratch' attempts and failed with a similar failure mode.

解决方案

Alright. My solution comes with two functions. The first, splitObj, takes care of splitting an object into the flat data and the sublist or subobject which will later require the recursion. The second, flatten, actually iterates of a list of objects, makes the recursive calls and takes care of reconstructing the final object for each iteration.

def splitObj (obj, prefix = None):
    '''
    Split the object, returning a 3-tuple with the flat object, optionally
    followed by the key for the subobjects and a list of those subobjects.
    '''
    # copy the object, optionally add the prefix before each key
    new = obj.copy() if prefix is None else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }

    # try to find the key holding the subobject or a list of subobjects
    for k, v in new.items():
        # list of subobjects
        if isinstance(v, list):
            del new[k]
            return new, k, v
        # or just one subobject
        elif isinstance(v, dict):
            del new[k]
            return new, k, [v]
    return new, None, None

def flatten (data, prefix = None):
    '''
    Flatten the data, optionally with each key prefixed.
    '''
    # iterate all items
    for item in data:
        # split the object
        flat, key, subs = splitObj(item, prefix)

        # just return fully flat objects
        if key is None:
            yield flat
            continue

        # otherwise recursively flatten the subobjects
        for sub in flatten(subs, key):
            sub.update(flat)
            yield sub

Note that this does not exactly produce your desired output. The reason for this is that your output is actually inconsistent. In the second example, for the case where there are companies nested in the industries, the nesting isn’t visible in the output. So instead, my output will generate industry_company_id and industry_company_symbol:

>>> ex1 = [{u'industry': [{u'id': u'112', u'name': u'A'},
                          {u'id': u'132', u'name': u'B'},
                          {u'id': u'110', u'name': u'C'}],
            u'name': u'materials'},
           {u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]
>>> ex2 = [{u'industry': [{u'id': u'112', u'name': u'A'},
                          {u'id': u'132', u'name': u'B'},
                          {u'company': [{u'id': '500', u'symbol': 'X'},
                                        {u'id': '502', u'symbol': 'Y'},
                                        {u'id': '504', u'symbol': 'Z'}],
                           u'id': u'110',
                           u'name': u'C'}],
            u'name': u'materials'},
           {u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]

>>> pprint(list(flatten(ex1)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
 {'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
 {'industry_id': u'110', 'industry_name': u'C', u'name': u'materials'},
 {'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]
>>> pprint(list(flatten(ex2)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
 {'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
 {'industry_company_id': '500',
  'industry_company_symbol': 'X',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_company_id': '502',
  'industry_company_symbol': 'Y',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_company_id': '504',
  'industry_company_symbol': 'Z',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]

这篇关于在Python中平铺通用JSON列表或列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆