将两个关系 pandas 数据帧合并为单个嵌套的 json 输出 [英] merging two relational pandas dataframes as single nested json output

查看:23
本文介绍了将两个关系 pandas 数据帧合并为单个嵌套的 json 输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个关系数据框,如下所示.

df_doc:

|document_id|姓名|+-----------+-----+|1|啊||2|bb|

df_topic:

<代码>|topic_id|名称|document_id|+-----------+-----+-----------+|1|xxx|1||2|YY|2||3|zzz|2|

我想将它们合并到一个嵌套的 json 文件中,如下所示.

<预><代码>[{"document_id": 1,"name": "aaa",主题":[{"topic_id": 1,姓名":xxx"}]},{"document_id": 2,"name": "bbb",主题":[{"topic_id": 2,姓名":YY"},{"topic_id": 3,"name": "zzz"}]}]

也就是说,我想做与pandas.io.json.json_normalize相反的事情.

使用 sqlite 的答案也可以.

注意:df_doc 和 df_topic 都包含名称相同但值不同的列name"

谢谢.

解决方案

If only 2 column df_doc use map 先加入新列 title 然后 groupby 并转换为 to_dict 然后to_json:

s = df_doc.set_index('document_id')['title']df_topic['title'] = df_topic['document_id'].map(s)#过滤列表中没有值的所有列cols = df_topic.columns.difference(['document_id','title'])j = (df_topic.groupby(['document_id','title'])[cols].apply(lambda x: x.to_dict('r')).reset_index(name='topics').to_json(orient='记录'))打印 (j)[{"document_id":1,"title":"aaa","topics":[{"name":"xxx","topic_id":1}]},{"document_id":2,"title":"bbb","topics":[{"name":"yyy","topic_id":2},{"name":"zzz","topic_id":3}]}]

如果 df_doc 中的多列使用 join 代替 map:

df = df_topic.merge(df_doc, on='document_id')打印 (df)topic_id 名称 document_id 标题0 1 xxx 1 aaa1 2 yyy 2 bbb2 3 zzz 2 bbbcols = df.columns.difference(['document_id','title'])j = (df.groupby(['document_id','title'])[cols].apply(lambda x: x.to_dict('r')).reset_index(name='topics').to_json(orient='记录'))

如果可以添加参数suffixes 以将_ 添加到唯一和最​​后一个strip 的列名:>

df = df_topic.merge(df_doc, on='document_id', suffixes=('','_'))打印 (df)topic_id 名称 document_id 名称_0 1 xxx 1 aaa1 2 yyy 2 bbb2 3 zzz 2 bbbcols = df.columns.difference(['document_id','title'])j = (df.groupby(['document_id','name_'])[cols].apply(lambda x: x.to_dict('r')).reset_index(name='topics').rename(columns=lambda x: x.rstrip('_')).to_json(orient='记录'))打印 (j)[{"document_id":1,"name":"aaa","topics":[{"name":"xxx","name_":"aaa","topic_id":1}]},{"document_id":2,"name":"bbb","topics":[{"name":"yyy","name_":"bbb","topic_id":2},{"name":"zzz","name_":"bbb","topic_id":3}]}]

I have two relational dataframes like the bellow.

df_doc:

|document_id| name|
+-----------+-----+
|          1|  aaa|
|          2|  bbb|

df_topic:

|   topic_id| name|document_id|
+-----------+-----+-----------+
|          1|  xxx|          1|
|          2|  yyy|          2|
|          3|  zzz|          2|

I want merge them to a single nested json file like the bellow.

[
    {
        "document_id": 1,
        "name": "aaa",
        "topics": [
            {
                "topic_id": 1,
                "name": "xxx"
            }
        ]
    },
    {
        "document_id": 2,
        "name": "bbb",
        "topics": [
            {
                "topic_id": 2,
                "name": "yyy"
            },
            {
                "topic_id": 3,
                "name": "zzz"
            }
        ]
    }
]

That is, I want to do the reverse of what pandas.io.json.json_normalize does.

An answer using sqlite, is also OK.

NOTE: Both df_doc and df_topic have columns "name" which have the same names but different values

Thanks.

解决方案

If only 2 column df_doc use map for join new column title first and then groupby with convert to to_dict and then to_json:

s = df_doc.set_index('document_id')['title']
df_topic['title'] = df_topic['document_id'].map(s)

#filter all columns without values in list
cols = df_topic.columns.difference(['document_id','title'])
j = (df_topic.groupby(['document_id','title'])[cols]
             .apply(lambda x: x.to_dict('r'))
             .reset_index(name='topics')
             .to_json(orient='records'))
print (j)

[{"document_id":1,"title":"aaa","topics":[{"name":"xxx","topic_id":1}]},
 {"document_id":2,"title":"bbb","topics":[{"name":"yyy","topic_id":2},
                                          {"name":"zzz","topic_id":3}]}]

If multiple columns in df_doc use join instead map:

df = df_topic.merge(df_doc, on='document_id')
print (df)
   topic_id name  document_id title
0         1  xxx            1   aaa
1         2  yyy            2   bbb
2         3  zzz            2   bbb

cols = df.columns.difference(['document_id','title'])
j = (df.groupby(['document_id','title'])[cols]
       .apply(lambda x: x.to_dict('r'))
       .reset_index(name='topics')
       .to_json(orient='records'))

EDIT: If same columns names is possible add parameter suffixes for add _ to columns names for unique and last strip them:

df = df_topic.merge(df_doc, on='document_id', suffixes=('','_'))
print (df)
   topic_id name  document_id name_
0         1  xxx            1   aaa
1         2  yyy            2   bbb
2         3  zzz            2   bbb

cols = df.columns.difference(['document_id','title'])
j = (df.groupby(['document_id','name_'])[cols]
       .apply(lambda x: x.to_dict('r'))
       .reset_index(name='topics')
       .rename(columns=lambda x: x.rstrip('_'))
       .to_json(orient='records'))
print (j)
[{"document_id":1,"name":"aaa","topics":[{"name":"xxx","name_":"aaa","topic_id":1}]},
 {"document_id":2,"name":"bbb","topics":[{"name":"yyy","name_":"bbb","topic_id":2},
                                         {"name":"zzz","name_":"bbb","topic_id":3}]}]

这篇关于将两个关系 pandas 数据帧合并为单个嵌套的 json 输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆