用每个字典的唯一键展平嵌套字典? [英] Flattening a nested dictionary with unique keys for each dictionary?

查看:41
本文介绍了用每个字典的唯一键展平嵌套字典?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一本具有以下格式的字典:

I have a dictionary that has the following format:

´´´{'7453': 
      {'2H': 
         {'1155': 
            {'in': [{'playerId': 281253}, {'playerId': 169212}], 
            'out': [{'playerId': 449240}, {'playerId': 257943}]},
          '2011': 
            {'in': [{'playerId': 449089}], 
            'out': [{'playerId': 69374}]}, 
          '2568': 
            {'in': [{'playerId': 481900}], 
            'out': [{'playerId': 1735}]}}}, 
    '7454': 
       {'1H': 
          {'2833': 
             {'in': [{'playerId': 56390}], 
             'out': [{'playerId': 208089}]}}, 
        '2H': 
          {'687': 
             {'in': [{'playerId': 574}], 
             'out': [{'playerId': 578855}]}, 
          '1627': 
             {'in': [{'playerId': 477400}], 
             'out': [{'playerId': 56386}]}, 
          '2725': 
             {'in': [{'playerId': 56108}], 
             'out': [{'playerId': 56383}]}}}}
´´´

我需要以下格式的数据(以df格式): https://i.stack.imgur.com/GltRb.png

I need the data in the following format (in a df): https://i.stack.imgur.com/GltRb.png

这意味着我想展平我的数据,以使我的ID为:"7453",一半为:"H2",分钟为"2011",类型:"out",playerId:"281253".另外,我需要每位玩家一条记录,但是那条记录仍然包含所有其他数据(id,一半等)

That means that I would like to flatten my data so that I have the id: "7453", half: "H2", minute: "2011", type: "out", playerId: "281253". Also, I need one record per player, but that still has all the other data (id, half etc.)

我已经为此苦苦挣扎了好几天,似乎无法为这个特定问题找到任何解决方案.到目前为止,我已经能够使用pd.json_normalize()或flatten_json()解决它.但是,在这种情况下,它对我而言并不成功.如果有人可以指出正确的方向或编写一些可以解决此问题的代码,将不胜感激!

I have been struggling with this for days, and can't seem to find any solution for this particular problem. Until now I have been able to solve it either using, pd.json_normalize() or flatten_json(). But it just doesn't make it for me, in this case. If anyone could point me in the right direction or write some code that could solve this, it would be much appreciated!

FYI:我最大的困难是我实际上需要一个标题/列作为我的键.

FYI: The biggest struggle I have is that I actually need a header/column for my keys.

推荐答案

pandas具有

pandas has explode to unwrap lists but I am not aware of a method for dictionaries.

由于字典的结构非常好,因此您可以尝试

As your dictionary is extremely well structured, you can try

[28]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd
    ...: .Series).stack().explode().apply(pd.Series).reset_index().rename(column
    ...: s={'level_0': 'teamId', 'level_1': 'matchPeriod', 'level_2': 'eventSec'
    ...: , 'level_3': 'type'})                                                  
Out[28]: 
   teamId matchPeriod eventSec type  playerId
0    7453          2H     1155   in    281253
1    7453          2H     1155   in    169212
2    7453          2H     1155  out    449240
3    7453          2H     1155  out    257943
4    7453          2H     2011   in    449089
..    ...         ...      ...  ...       ...
11   7454          2H     1627  out     56386
12   7454          2H     2725   in     56108
13   7454          2H     2725  out     56383
14   7454          1H     2833   in     56390
15   7454          1H     2833  out    208089

尽管将 Series 构造函数和 stack 链接起来非常丑陋,但是会逐级构建DataFrame.

Although extremely ugly chaining the Series constructor and stack will build up the DataFrame level by level.

更新:原则上,您可以将字典传递给 DataFrame Series 构造函数

Update: In principle you can pass a dictionary to the DataFrame and Series constructors

In [2]: d                                                                                                                                                                                                  
Out[2]: 
{'7453': {'2H': {'1155': {'in': [{'playerId': 281253}, {'playerId': 169212}],
    'out': [{'playerId': 449240}, {'playerId': 257943}]},
   '2011': {'in': [{'playerId': 449089}], 'out': [{'playerId': 69374}]},
   '2568': {'in': [{'playerId': 481900}], 'out': [{'playerId': 1735}]}}},
 '7454': {'1H': {'2833': {'in': [{'playerId': 56390}],
    'out': [{'playerId': 208089}]}},
  '2H': {'687': {'in': [{'playerId': 574}], 'out': [{'playerId': 578855}]},
   '1627': {'in': [{'playerId': 477400}], 'out': [{'playerId': 56386}]},
   '2725': {'in': [{'playerId': 56108}], 'out': [{'playerId': 56383}]}}}}

In [3]: pd.DataFrame(d)                                                                                                                                                                                    
Out[3]: 
                        7453                      7454
2H  {'1155': {'in': [{'pl...  {'687': {'in': [{'pla...
1H                       NaN  {'2833': {'in': [{'pl...

In [4]: pd.Series(d)                                                                                                                                                                                       
Out[4]: 
7453    {'2H': {'1155': {'in'...
7454    {'1H': {'2833': {'in'...
dtype: object

由于它们分别是2维和1维数据结构,因此它们还期望分别具有2级和1级深度嵌套的字典. DataFrame 将您的"teamId"解释为索引,将"matchPeriod"解释为列,其值是字典中的值,如

As they are 2-dimensional and 1-dimensional data structures respectively, they also expect a dictionary with 2 and 1 level deep nesting respectively. The DataFrame interprets your 'teamId' as index and 'matchPeriod' as columns and the values are the values of the dictionaries like in

In [5]: d['7453']['2H']                                                                                                                                                                                    
Out[5]: 
{'1155': {'in': [{'playerId': 281253}, {'playerId': 169212}],
  'out': [{'playerId': 449240}, {'playerId': 257943}]},
 '2011': {'in': [{'playerId': 449089}], 'out': [{'playerId': 69374}]},
 '2568': {'in': [{'playerId': 481900}], 'out': [{'playerId': 1735}]}}

系列的行为方式相同,但只有一个级别.

The Series behaves the same way, but with only one level.

In [6]: d['7453']                                                                                                                                                                                          
Out[6]: 
{'2H': {'1155': {'in': [{'playerId': 281253}, {'playerId': 169212}],
   'out': [{'playerId': 449240}, {'playerId': 257943}]},
  '2011': {'in': [{'playerId': 449089}], 'out': [{'playerId': 69374}]},
  '2568': {'in': [{'playerId': 481900}], 'out': [{'playerId': 1735}]}}}

是您的第一级.现在这又是一本字典,因此您也可以将其传递给 Series 构造函数

is your first level. Now this is a dictionary again, so you can pass it the the Series constructor as well

In [7]: pd.Series(d['7453'])                                                                                                                                                                               
Out[7]: 
2H    {'1155': {'in': [{'pl...
dtype: object

apply 函数允许您对 Series

In [8]: pd.Series(d).apply(pd.Series)                                                                                                                                                                      
Out[8]: 
                            2H                        1H
7453  {'1155': {'in': [{'pl...                       NaN
7454  {'687': {'in': [{'pla...  {'2833': {'in': [{'pl...

现在,您获得与 DataFrame 构造函数相同的结果.这称为广播.原始 Series 的每个值no都将成为其自己的 Series ,并且索引用作列标签.通过调用 stack ,您可以告诉熊猫给您一系列intead的信息,并在需要时将所有标签堆叠到 MultiIndex .

Now you arrive at the same result as with the DataFrame constructor. This is called broadcasting. Each value of the original Series no becomes its own Series and the index is used as column labels. By calling stack you intead tell pandas to give you a series intead and stack all the labels to a MultiIndex if needed.

In [9]: pd.Series(d).apply(pd.Series).stack()                                                                                                                                                              
Out[9]: 
7453  2H    {'1155': {'in': [{'pl...
7454  2H    {'687': {'in': [{'pla...
      1H    {'2833': {'in': [{'pl...
dtype: object

现在您又有了一个Series(具有2d索引),其中每个值都是一个字典,该字典又可以传递给 Series 构造函数.因此,如果您重复此 apply(pd.Series).stack()链,您将得到

Now you again have a Series (with a 2d index) where each value is a dictionary which - again - can be passed to the Series constructor. So if you repeat this chain of apply(pd.Series).stack() you get

In [10]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack()                                                                                                                                    
Out[10]: 
7453  2H  1155    {'in': [{'playerId': ...
          2011    {'in': [{'playerId': ...
          2568    {'in': [{'playerId': ...
7454  2H  687     {'in': [{'playerId': ...
          1627    {'in': [{'playerId': ...
          2725    {'in': [{'playerId': ...
      1H  2833    {'in': [{'playerId': ...
dtype: object

现在您又有了一个Series(具有3d索引),其中每个值都是一个字典,可以再次将其传递给 Series 构造函数.

Now you again have a Series (with a 3d index) where each value is a dictionary which - again - can be passed to the Series constructor.

In [11]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack()                                                                                                           
Out[11]: 
7453  2H  1155  in     [{'playerId': 281253}...
                out    [{'playerId': 449240}...
          2011  in       [{'playerId': 449089}]
                out       [{'playerId': 69374}]
          2568  in       [{'playerId': 481900}]
                out        [{'playerId': 1735}]
7454  2H  687   in          [{'playerId': 574}]
                out      [{'playerId': 578855}]
          1627  in       [{'playerId': 477400}]
                out       [{'playerId': 56386}]
          2725  in        [{'playerId': 56108}]
                out       [{'playerId': 56383}]
      1H  2833  in        [{'playerId': 56390}]
                out      [{'playerId': 208089}]
dtype: object

这是一种特殊情况,因为现在您的值不再是字典,而是列表(每个都有一个元素).对于列表(不幸的是,不是字典),pandas中有一个 explode()方法可为每个列表元素创建一个新行.

This is a special case as now your values are no longer dictionaries but lists (with one element each). For lists (and unfortunately not for dictionaries) there is the explode() method in pandas to create a new row for each list element.

In [13]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().explode()                                                                                                 
Out[13]: 
7453  2H  1155  in     {'playerId': 281253}
                in     {'playerId': 169212}
                out    {'playerId': 449240}
                out    {'playerId': 257943}
          2011  in     {'playerId': 449089}
                               ...         
7454  2H  1627  out     {'playerId': 56386}
          2725  in      {'playerId': 56108}
                out     {'playerId': 56383}
      1H  2833  in      {'playerId': 56390}
                out    {'playerId': 208089}
dtype: object

解压缩每个列表.现在,您又有了一个Series(具有4d索引),其中每个值都是一个字典,可以再次将其传递给 Series 构造函数.

unpacks each list. Now you again have a Series (with a 4d index) where each value is a dictionary which - again - can be passed to the Series constructor.

In [14]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().explode().apply(pd.Series).stack()                                                                        
Out[14]: 
7453  2H  1155  in   playerId    281253
                     playerId    169212
                out  playerId    449240
                     playerId    257943
          2011  in   playerId    449089
                                  ...  
7454  2H  1627  out  playerId     56386
          2725  in   playerId     56108
                out  playerId     56383
      1H  2833  in   playerId     56390
                out  playerId    208089
dtype: int64

在将 Series 构造函数应用于字典并重新调整数据形状直到可以再次应用它的这五次迭代之后,您就完全解压缩了字典.

With these five iterations of applying the Series constructor to your dictionary and reshaping the data until you can apply it again, you got your dictionary fully unpacked.

为了匹配所需的结果,您可以使用 reset_index 将所有级别的索引设置为一列.

In order to match your desired result you can make all levels of the index to a column with reset_index.

In [15]: pd.Series(d).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().explode().apply(pd.Series).stack().reset_index()                                                          
Out[15]: 
   level_0 level_1 level_2 level_3   level_4       0
0     7453      2H    1155      in  playerId  281253
1     7453      2H    1155      in  playerId  169212
2     7453      2H    1155     out  playerId  449240
3     7453      2H    1155     out  playerId  257943
4     7453      2H    2011      in  playerId  449089
..     ...     ...     ...     ...       ...     ...
11    7454      2H    1627     out  playerId   56386
12    7454      2H    2725      in  playerId   56108
13    7454      2H    2725     out  playerId   56383
14    7454      1H    2833      in  playerId   56390
15    7454      1H    2833     out  playerId  208089

系列和索引级别都没有名称.默认情况下,它使用列号( 0 )作为值(应为'playerId'),并使用 level_0 level_4 作为索引级别.为了适当地设置这些值,一种方法是在调用 reset_index 之前重命名 Series ,并用 rename 重命名 levels 之后.

Neither the Series nor the index levels had names. By default it uses the column number (0) for the values (which should be 'playerId') and level_0 to level_4 for the index levels. In order to set these appropriately one way is to rename the Series before calling reset_index and rename the levels with rename afterwards.

我希望对您有帮助

这篇关于用每个字典的唯一键展平嵌套字典?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆