Pandas DataFrame合并,最终出现更多行 [英] Pandas DataFrame merge, ends up with more rows
问题描述
我在做
a_df = a_df.merge(b_df, how='left', on=['col1', col2])
此后, a_df
实际上具有比操作之前更多的行.这怎么可能?
After this, a_df
actually has more rows than before the operation. How is this possible?
它们都具有数百万行,因此我很难缩小问题的范围.可能我缺少有关左合并如何工作的信息.
They both have millions of rows, so it's hard for me to narrow down the problem. Probably I am missing something about how left merge works.
推荐答案
问题在于重复项,因此左联接 merge
返回了两个 DataFrame
的重复项对的所有组合s,请检查以下示例:
Problem is with duplicates, so instead left join merge
return all combination of dupplicates pairs of both DataFrame
s, check sample below:
a_df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'col1':[5,5,5,9,9,9],
'col2':list('aaabbb')})
print (a_df)
A B C D col1 col2
0 a 4 7 1 5 a
1 b 5 8 3 5 a
2 c 4 9 5 5 a
3 d 5 4 7 9 b
4 e 5 2 1 9 b
5 f 4 3 0 9 b
b_df = pd.DataFrame({'E':[7,8,0,1],
'F':list('efgh'),
'col1':[5,5,9,9],
'col2':list('aabb')})
print (b_df)
E F col1 col2
0 7 e 5 a
1 8 f 5 a
2 0 g 9 b
3 1 h 9 b
a_df = a_df.merge(b_df, how='left', on=['col1', 'col2'])
print (a_df)
A B C D col1 col2 E F
0 a 4 7 1 5 a 7 e
1 a 4 7 1 5 a 8 f
2 b 5 8 3 5 a 7 e
3 b 5 8 3 5 a 8 f
4 c 4 9 5 5 a 7 e
5 c 4 9 5 5 a 8 f
6 d 5 4 7 9 b 0 g
7 d 5 4 7 9 b 1 h
8 e 5 2 1 9 b 0 g
9 e 5 2 1 9 b 1 h
10 f 4 3 0 9 b 0 g
11 f 4 3 0 9 b 1 h
Solution1 是删除第二个 DataFrame
中的重复项:
Solution1 is remove duplicates in second DataFrame
:
b_df = b_df.drop_duplicates(['col1', 'col2'])
print (b_df)
E F col1 col2
0 7 e 5 a
2 0 g 9 b
a_df = a_df.merge(b_df, how='left', on=['col1', 'col2'])
print (a_df)
A B C D col1 col2 E F
0 a 4 7 1 5 a 7 e
1 b 5 8 3 5 a 7 e
2 c 4 9 5 5 a 7 e
3 d 5 4 7 9 b 0 g
4 e 5 2 1 9 b 0 g
5 f 4 3 0 9 b 0 g
Solution2 通过聚集创建对 col1
和 col2
对的唯一值:
Solution2 is create unique values of pairs col1
and col2
by aggregation:
b_df = b_df.groupby(['col1', 'col2'], as_index=False).agg({'E':'mean', 'F': ','.join})
print (b_df)
col1 col2 E F
0 5 a 7.5 e,f
1 9 b 0.5 g,h
a_df = a_df.merge(b_df, how='left', on=['col1', 'col2'])
print (a_df)
A B C D col1 col2 E F
0 a 4 7 1 5 a 7.5 e,f
1 b 5 8 3 5 a 7.5 e,f
2 c 4 9 5 5 a 7.5 e,f
3 d 5 4 7 9 b 0.5 g,h
4 e 5 2 1 9 b 0.5 g,h
5 f 4 3 0 9 b 0.5 g,h
也可以通过重复
和 布尔索引
:
Also is possible check all dupes in df_b
by duplicated
and boolean indexing
:
print (b_df[b_df.duplicated(['col1', 'col2'], keep=False)])
E F col1 col2
0 7 e 5 a
1 8 f 5 a
2 0 g 9 b
3 1 h 9 b
这篇关于Pandas DataFrame合并,最终出现更多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!