从 Pandas DataFrame 创建复杂的嵌套字典 [英] Creating complex nested dictionaries from Pandas DataFrame

查看:45
本文介绍了从 Pandas DataFrame 创建复杂的嵌套字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到一种从平面 Pandas DataFrame 实例创建(可能很深)嵌套字典的通用方法.

假设我有以下数据帧:

dat = pd.DataFrame({'name' : ['John', 'John', 'John', 'John', 'Henry', 'Henry'],'年龄' : [24, 24, 24, 24, 31, 31],'性别' : ['男','男','男','男','男','男'],'研究' : ['数学', '数学', '数学', '哲学', '物理学', '物理学'],'课程':['微积分101','微积分101','微积分102','亚里士多德伦理学','量子力学','量子力学'],'测试':['考试','论文','考试','论文','考试1','考试2'],通过":[真,真,真,真,真,真],'等级' : ['A', 'A', 'B', 'A', 'C', 'C']})dat = dat[['name', 'age', 'gender', 'study', 'course', 'test', 'grade', 'pass']] #重新排列列以更好地反映数据结构

我想创建一个深度嵌套的字典(或嵌套字典列表),它尊重"这些数据的底层结构.也就是说,成绩是关于测试的信息,它是一个人所做的课程的一部分,它是研究的一部分.此外,年龄和性别是关于同一个人的信息.

所需输出的示例如下:

[{'John': {'age': 24,'性别': '男','学习':{'数学':{'微积分101':{'考试':{'等级':'B','通过':真}}},'哲学':{'亚里士多德伦理学':{'论文':{'等级':'A','通过':真}}}}}},{'亨利':{'年龄':31,'性别': '男','研究':{'物理':{'量子力学':​​{'Exam1':{'Grade':'C','通过':真},'考试2':{'等级':'C','通过':真}}}}}}]

(尽管可能有其他类似的方法来构造此类数据).

我尝试使用 groupby,这很容易,例如,在 'test' 下嵌套 'grade' 和 'pass',在 'course' 下嵌套 'test',在 'study' 下嵌套 'course' 和 '研究'在'名字'下.但是,那么我不知道如何在姓名"下添加性别"和年龄"?像这样的东西是我想出的最好的:

dic = {}对于 ind,在 dat.groupby(['name', 'study', 'course', 'test'])['grade', 'pass'] 中排行:#这很丑陋而且不是很通用,但仅作为示例如果在 dic 中不是 ind[0]:dic[ind[0]] = {}如果在 dic[ind[0]] 中不是 ind[1]:dic[ind[0]][ind[1]] = {}如果在 dic[ind[0]][ind[1]] 中不是 ind[2]:dic[ind[0]][ind[1]][ind[2]] = {}如果在 dic[ind[0]][ind[1]][ind[2]] 中没有 ind[3]:dic[ind[0]][ind[1]][ind[2]][ind[3]] = {}dic[ind[0]][ind[1]][ind[2]][ind[3]]['grade'] = row['grade'].values[0]dic[ind[0]][ind[1]][ind[2]][ind[3]]['pass'] = row['pass'].values[0]

但在这种情况下,'age' 和 'gender' 没有嵌套在 'name' 下.我似乎无法理解如何做到这一点......

另一种选择是设置 MultiIndex 并进行 .to_dict('index') 调用.但是话又说回来,我不明白如何在一个键下同时嵌套字典和非字典......

我的问题与此类似:将pandas DataFrame 转换为嵌套的dict,但我正在寻找更复杂的嵌套(例如,不仅仅是最后一列应该嵌套在所有其他列之下).Stackoverflow 上的大多数其他问题都要求相反:从深度嵌套的字典创建(可能是多索引)DataFrame.

问题也类似于这个问题:Pandas convert Dataframe to Nested Json,但在那个问题中,只有 last 列(例如 n 列)应该嵌套在所有其他列(n-1em>、n-2 等;完全递归嵌套).在我的问题中,列 nn-1 应该嵌套在 n-2 下,但是列 n-2n-3 应该嵌套在 n-4 下(因此,重要的是,n-2 不是嵌套在 n-3 下,但在 n-4 下).Mohammad Yusuf Ghazi 提供的 MultiIndex 部分解决方案很好地描述了结构.

解决方案

不是很简洁,但这是我现在能得到的最好的:

<预><代码>>>>定义汇总1(x):... return x.set_index('test')[['grade', 'pass']].to_dict(orient='index')>>>定义汇总2(x):...返回 x.groupby('course').apply(rollup1).to_dict()>>>定义汇总3(x):...返回 x.groupby('study').apply(rollup2).to_dict()>>>df = dat.groupby(['name','age','gender']).apply(rollup3)>>>df.name = '研究'>>>res = df.reset_index(level=[1,2]).to_dict(orient='index')>>>pprint.pprint(res){'亨利':{'年龄':31L,'性别': '男','研究':{'物理学':{'量子力学':​​{'考试1':{'等级':'C','通过':真},'考试2':{'等级':'C','通过':真}}}}},'约翰':{'年龄':24L,'性别': '男','学习':{'数学':{'微积分101':{'论文':{'等级':'A','通过':真},'考试':{'等级':'A','通过':真}},'微积分102':{'考试':{'等级':'B','通过':真}}},'哲学':{'亚里士多德伦理学':{'论文':{'等级':'A','通过':真}}}}}}

这个想法是将数据汇总到字典中,同时对数据进行分组以获得研究"列

更新我试图创建更通用的解决方案,所以它适用于像 这个问题 还有:

def rollup_to_dict_core(x, values, columns, d_columns=None):如果 d_columns 为 None:d_columns = []如果 len(columns) == 1:如果 len(values) == 1:返回 x.set_index(columns)[values[0]].to_dict()别的:返回 x.set_index(columns)[values].to_dict(orient='index')别的:res = x.groupby([columns[0]] + d_columns).apply(lambda y: rollup_to_dict_core(y, values, columns[1:]))如果 len(d_columns) == 0:返回 res.to_dict()别的:res.name = 列[1]res = res.reset_index(level=range(1, len(d_columns) + 1))返回 res.to_dict(orient='index')def rollup_to_dict(x, values, d_columns=None):如果 d_columns 为 None:d_columns = []列 = [c for c in x.columns 如果 c 不在值中且 c 不在 d_columns 中]返回 rollup_to_dict_core(x, values, columns, d_columns)>>>pprint(rollup_to_dict(dat, ['pass', 'grade'], ['age','gender'])){'亨利':{'年龄':31L,'性别': '男','研究':{'物理学':{'量子力学':​​{'考试1':{'等级':'C','通过':真},'考试2':{'等级':'C','通过':真}}}}},'约翰':{'年龄':24L,'性别': '男','学习':{'数学':{'微积分101':{'论文':{'等级':'A','通过':真},'考试':{'等级':'A','通过':真}},'微积分102':{'考试':{'等级':'B','通过':真}}},'哲学':{'亚里士多德伦理学':{'论文':{'等级':'A','通过':真}}}}}}

I'm trying to find a generic way of creating (possibly deeply) nested dictionaries from a flat Pandas DataFrame instance.

Suppose I have the following DataFrame:

dat = pd.DataFrame({'name' : ['John', 'John', 'John', 'John', 'Henry', 'Henry'],
                    'age' : [24, 24, 24, 24, 31, 31],
                    'gender' : ['Male','Male','Male','Male','Male','Male'],
                    'study' : ['Mathematics', 'Mathematics', 'Mathematics', 'Philosophy', 'Physics', 'Physics'],
                    'course' : ['Calculus 101', 'Calculus 101', 'Calculus 102', 'Aristotelean Ethics', 'Quantum mechanics', 'Quantum mechanics'],
                    'test' : ['Exam', 'Essay','Exam','Essay', 'Exam1','Exam2'],
                    'pass' : [True, True, True, True, True, True],
                    'grade' : ['A', 'A', 'B', 'A', 'C', 'C']})
dat = dat[['name', 'age', 'gender', 'study', 'course', 'test', 'grade', 'pass']] #re-order columns to better reflect data structure

I want to create a deeply nested dictionary (or list of nested dictionaries), that 'respects' the underlying structure of this data. That is, a grade is information about a test, which is part of a course, which is part of a study, that a person does. Also, age and gender are information about that same person.

An example desired output is this:

[{'John': {'age': 24,
           'gender': 'Male',
           'study': {'Mathematics': {'Calculus 101': {'Exam': {'grade': 'B',
                                                               'pass': True}}},
                     'Philosophy': {'Aristotelean Ethics': {'Essay': {'grade': 'A',
                                                                      'pass': True}}}}}},
 {'Henry': {'age': 31,
            'gender': 'Male',
            'study': {'Physics': {'Quantum mechanics': {'Exam1': {'Grade': 'C',
                                                                  'Pass': True},
                                                        'Exam2': {'Grade': 'C',
                                                                  'Pass': True}}}}}}]

(although there may be other, similar ways to structure such data).

I tried using groupby, which makes it easy, for example, to nest 'grade' and 'pass' under 'test', nest 'test' under 'course', nest 'course' under 'study', and 'study' under 'name'. But, then I don't see how to add 'gender' and 'age' under 'name' as well? Something like this is the best I came up with:

dic = {}
for ind, row in dat.groupby(['name', 'study', 'course', 'test'])['grade', 'pass']:

    #this is ugly and not very generic, but just as an example
    if not ind[0] in dic:
        dic[ind[0]] = {}
    if not ind[1] in dic[ind[0]]:
        dic[ind[0]][ind[1]] = {}
    if not ind[2] in dic[ind[0]][ind[1]]:
        dic[ind[0]][ind[1]][ind[2]] = {}
    if not ind[3] in dic[ind[0]][ind[1]][ind[2]]:
        dic[ind[0]][ind[1]][ind[2]][ind[3]] = {}

    dic[ind[0]][ind[1]][ind[2]][ind[3]]['grade'] = row['grade'].values[0]
    dic[ind[0]][ind[1]][ind[2]][ind[3]]['pass'] = row['pass'].values[0]

But in this case, 'age' and 'gender' are not nested under 'name'. I can't seem to wrap my head around how to do this...

Another option is to set a MultiIndex and make a .to_dict('index') call. But then again, I don't see how I can nest both dicts and non-dicts under a single key...

My question is similar to this one: Convert pandas DataFrame to a nested dict, but I'm looking for a more complex nesting (e.g., not just one last column which should be nested under all other columns). Most other questions on Stackoverflow ask for the reverse: creating a (possibly MultiIndex) DataFrame from a deeply nested dictionary.

Edit: The question is also similar to this q: Pandas convert Dataframe to Nested Json, but in that question, only the last column (e.g., column n) should be nested under all other columns (n-1, n-2 etc; fully recursive nesting). In my question, column n and n-1 should be nested under n-2, but column n-2 and n-3 should be nested under n-4 (thus, importantly, n-2 is not nested under n-3 but under n-4). The MultiIndex partial solution offered by Mohammad Yusuf Ghazi depicts the structure nicely.

解决方案

Not really concise, but it's the best I can get now:

>>> def rollup1(x):
...     return x.set_index('test')[['grade', 'pass']].to_dict(orient='index')
>>> def rollup2(x):
...     return x.groupby('course').apply(rollup1).to_dict()
>>> def rollup3(x):
...     return x.groupby('study').apply(rollup2).to_dict()

>>> df = dat.groupby(['name','age','gender']).apply(rollup3)
>>> df.name = 'study'
>>> res = df.reset_index(level=[1,2]).to_dict(orient='index')
>>> pprint.pprint(res)
{'Henry': {'age': 31L,
           'gender': 'Male',
           'study': {'Physics': {'Quantum mechanics': {'Exam1': {'grade': 'C',
                                                                 'pass': True},
                                                       'Exam2': {'grade': 'C',
                                                                 'pass': True}}}}},
 'John': {'age': 24L,
          'gender': 'Male',
          'study': {'Mathematics': {'Calculus 101': {'Essay': {'grade': 'A',
                                                               'pass': True},
                                                     'Exam': {'grade': 'A',
                                                              'pass': True}},
                                    'Calculus 102': {'Exam': {'grade': 'B',
                                                              'pass': True}}},
                    'Philosophy': {'Aristotelean Ethics': {'Essay': {'grade': 'A',
                                                                     'pass': True}}}}}}

The idea is to roll up data to dictionaries while grouping data to get 'study' column

update I've tried to create more generic solution, so it'd work for question like this one as well:

def rollup_to_dict_core(x, values, columns, d_columns=None):
    if d_columns is None:
        d_columns = []

    if len(columns) == 1:
        if len(values) == 1:
            return x.set_index(columns)[values[0]].to_dict()
        else:
            return x.set_index(columns)[values].to_dict(orient='index')
    else:
        res = x.groupby([columns[0]] + d_columns).apply(lambda y: rollup_to_dict_core(y, values, columns[1:]))
        if len(d_columns) == 0:
            return res.to_dict()
        else:
            res.name = columns[1]
            res = res.reset_index(level=range(1, len(d_columns) + 1))
            return res.to_dict(orient='index')

def rollup_to_dict(x, values, d_columns=None):
    if d_columns is None:
        d_columns = []

    columns = [c for c in x.columns if c not in values and c not in d_columns]
    return rollup_to_dict_core(x, values, columns, d_columns)

>>> pprint(rollup_to_dict(dat, ['pass', 'grade'], ['age','gender']))
{'Henry': {'age': 31L,
           'gender': 'Male',
           'study': {'Physics': {'Quantum mechanics': {'Exam1': {'grade': 'C',
                                                                 'pass': True},
                                                       'Exam2': {'grade': 'C',
                                                                 'pass': True}}}}},
 'John': {'age': 24L,
          'gender': 'Male',
          'study': {'Mathematics': {'Calculus 101': {'Essay': {'grade': 'A',
                                                               'pass': True},
                                                     'Exam': {'grade': 'A',
                                                              'pass': True}},
                                    'Calculus 102': {'Exam': {'grade': 'B',
                                                              'pass': True}}},
                    'Philosophy': {'Aristotelean Ethics': {'Essay': {'grade': 'A',
                                                                     'pass': True}}}}}}

这篇关于从 Pandas DataFrame 创建复杂的嵌套字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆