从 pandas 数据框创建嵌套字典 [英] create nested dict from pandas dataframe

查看:83
本文介绍了从 pandas 数据框创建嵌套字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,我想从中获取信息并创建一个嵌套的字典供下游使用,但是,我还不太擅长使用熊猫,我可以寻求帮助!

I have a pandas dataframe that I would like to pull information from and create a nested dictionary for downstream use, however, I'm not very good at working with pandas yet and I could use some help!

我的数据框看起来像这样:

My dataframe looks something like this:

    Sequence    A_start A_stop  B_start B_stop
0   sequence_1  1   25  26  100
1   sequence_2  1   31  32  201
2   sequence_3  1   27  28  231
3   sequence_4  1   39  40  191

我想将其写到字典中,使其具有以下形式:

I want to write this to a dictionary so that it has this form:

d = {‘Sequnce: {(‘A_start’, ‘A_stop’) : [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}], ('B_start', 'B_stop): [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]}}

生成后看起来像这样:

{‘sequence_1’: {(‘1’, ‘25’) : [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}], (‘26’, '100’): [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]}, 
‘sequence_2’: {(‘1’, ‘31’) : [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}], ('32', '201’): [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]}, ...}

我认为列表理解可能是解决此问题的简便方法,但最终可能看起来过于复杂.到目前为止,这是我目前尚无法解决的问题.我不确定是否可以使用iteritems()或groupby()以外的方法来识别字典中条目的结构.任何帮助将不胜感激!

I thought a list comprehension might be an easy way to deal with this, but it might end up looking overly complicated. This is what I have so far that clearly doesn't work yet. I'm not sure if I can use iteritems() or something other than groupby() to identify the structure of the entries into the dict. Any help would be appreciated!

LTR_sub_features = [{'repeat_region':{'rpt_type':'long_terminal_repeat', 'note':"5'LTR"}}]
gag_sub_features = [{'misc_feature':{'gene': 'Gag', 'note': 'deletion of start codon'}}]

ltr_gag_dict = {
Sequence: {(A_start,A_end): LTR_sub_features, (B_start,B_end):gag_sub_features} 
for Sequence, A_start, A_end, B_start, B_end in ltr_gag_df.groupby('Sequence')}

推荐答案

您可以使用iterrows()随时更新字典:
iterrows()为每一行创建一个元组,其中第一个元素(即row [0])是该行的索引,第二个元素是该行中所有值的pd.Serie对象.

You can use iterrows() to update a dictionary as-you-go:
iterrows() creates a tuple for each row, where the first element (i.e row[0]) is the row's index, and the 2nd element is a pd.Serie object for all the values in the row.

<input>
            A_start A_end   B_start     B_end
sequence_1  0.1     0.025   0.030303    0.001
sequence_2  0.2     0.050   0.060606    0.002
sequence_3  0.3     0.075   0.090909    0.003
sequence_4  0.4     0.100   0.121212    0.004

A_value = 'some value'
B_value = 'other value'
d = dict()


for row in df.iterrows():  
    d[row[0]] = {(row[1]['A_start'], row[1]['A_end']): A_value, (row[1]['B_start'], row[1]['B_end']): B_value}

<output>
{'sequence_1': {(0.10000000000000001, 0.025000000000000001): 'some value', (0.030303030303030304, 0.001): 'other value'}}

这篇关于从 pandas 数据框创建嵌套字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆