大 pandas 将df合并到很多,没有重复 [英] pandas merge df many to many without duplicates

查看:88
本文介绍了大 pandas 将df合并到很多,没有重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有两个df,如下所示:

suppose i have two df like below:

import pandas as pd

data_dic = {
    "a": [0,0,1,2],
    "b": [3,3,4,5],
    "c": [6,7,8,9]
}
df1 = pd.DataFrame(data_dic)

data_dic = {
    "a": [0,0,1,2],
    "b": [3,3,4,5],
    "d": [10,10,12,13]
}
df2 = pd.DataFrame(data_dic)

结果:

df1

   a  b  c
0  0  3  6
1  0  3  7
2  1  4  8
3  2  5  9

df2

   a  b   d
0  0  3  10
1  0  3  10
2  1  4  12
3  2  5  13

其中,每个df在列 a和 b上具有相同的键值。

where each of df have the same key value on column 'a' and 'b'.

问:如何将它们合并而不重复? [使用df合并然后删除重复项的选项不令人满意]

Q: How to merge them without duplicates? [option with merging df and then delete duplicates is unsatisfactory]

我尝试通过使用以下代码来做到这一点,但这会创建重复的行并显着增加文件容量。

I try to do this by using below codes but this creates duplicate rows and significantly increases the file capacity.

df = pd.merge(df1, df2, on=['a', 'b'] , how='left', validate='many_to_many')
df = pd.merge(df1, df2, on=['a', 'b'] , how='inner')

预期结果:

   a  b  c   d
0  0  3  6  10
1  0  3  7  10
2  1  4  8  12
3  2  5  9  13

感谢大家的辛勤工作。

Thank You All for hard work.

推荐答案

使用 GroupBy.cumcount 分别用于两个 DataFrames 与 合并 通过添加列:

Use GroupBy.cumcount for counter columns in both DataFrames with merge by added column:

df1['g'] = df1.groupby(['a','b']).cumcount()
df2['g'] = df2.groupby(['a','b']).cumcount()

df = pd.merge(df1, df2, on=['a', 'b', 'g'] , how='inner')
print (df)
   a  b  c  g   d
0  0  3  6  0  10
1  0  3  7  1  10
2  1  4  8  0  12
3  2  5  9  0  13

与另一种解决方案的不同之处在于,第二个df秒 10 11 -纠正来自 df1 a,b 的合并>前一个 a,b 从第二秒开始,所有重复项和唯一对都相似:

Difference with another solutions the best see in changed data in second df second 10 to 11 - it correct merge by first duplicate pair a, b from df1 with first a, b pais from second, similar for all duplicates and also for unique pairs:

data_dic = {
    "a": [0,0,1,2],
    "b": [3,3,4,5],
    "d": [10,11,12,13]
}
df2 = pd.DataFrame(data_dic)


df1['g'] = df1.groupby(['a','b']).cumcount()
df2['g'] = df2.groupby(['a','b']).cumcount()

df = pd.merge(df1, df2, on=['a', 'b', 'g'] , how='inner')
print (df)

   a  b  c  g   d
0  0  3  6  0  10
1  0  3  7  1  11
2  1  4  8  0  12
3  2  5  9  0  13

这篇关于大 pandas 将df合并到很多,没有重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆