Pandas DataFrame合并,最终出现更多行 [英] Pandas DataFrame merge, ends up with more rows

查看:83
本文介绍了Pandas DataFrame合并,最终出现更多行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在做

a_df = a_df.merge(b_df, how='left', on=['col1', col2])

此后, a_df 实际上具有比操作之前更多的行.这怎么可能?

After this, a_df actually has more rows than before the operation. How is this possible?

它们都具有数百万行,因此我很难缩小问题的范围.可能我缺少有关左合并如何工作的信息.

They both have millions of rows, so it's hard for me to narrow down the problem. Probably I am missing something about how left merge works.

推荐答案

问题在于重复项,因此左联接 merge 返回了两个 DataFrame 的重复项对的所有组合s,请检查以下示例:

Problem is with duplicates, so instead left join merge return all combination of dupplicates pairs of both DataFrames, check sample below:

a_df = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'col1':[5,5,5,9,9,9],
                   'col2':list('aaabbb')})

print (a_df)
   A  B  C  D  col1 col2
0  a  4  7  1     5    a
1  b  5  8  3     5    a
2  c  4  9  5     5    a
3  d  5  4  7     9    b
4  e  5  2  1     9    b
5  f  4  3  0     9    b

b_df = pd.DataFrame({'E':[7,8,0,1],
                     'F':list('efgh'),
                     'col1':[5,5,9,9],
                     'col2':list('aabb')})

print (b_df)
   E  F  col1 col2
0  7  e     5    a
1  8  f     5    a
2  0  g     9    b
3  1  h     9    b


a_df = a_df.merge(b_df, how='left', on=['col1', 'col2'])
print (a_df)
    A  B  C  D  col1 col2  E  F
0   a  4  7  1     5    a  7  e
1   a  4  7  1     5    a  8  f
2   b  5  8  3     5    a  7  e
3   b  5  8  3     5    a  8  f
4   c  4  9  5     5    a  7  e
5   c  4  9  5     5    a  8  f
6   d  5  4  7     9    b  0  g
7   d  5  4  7     9    b  1  h
8   e  5  2  1     9    b  0  g
9   e  5  2  1     9    b  1  h
10  f  4  3  0     9    b  0  g
11  f  4  3  0     9    b  1  h


Solution1 是删除第二个 DataFrame 中的重复项:


Solution1 is remove duplicates in second DataFrame:

b_df = b_df.drop_duplicates(['col1', 'col2'])
print (b_df)
   E  F  col1 col2
0  7  e     5    a
2  0  g     9    b

a_df = a_df.merge(b_df, how='left', on=['col1', 'col2'])
print (a_df)
   A  B  C  D  col1 col2  E  F
0  a  4  7  1     5    a  7  e
1  b  5  8  3     5    a  7  e
2  c  4  9  5     5    a  7  e
3  d  5  4  7     9    b  0  g
4  e  5  2  1     9    b  0  g
5  f  4  3  0     9    b  0  g

Solution2 通过聚集创建对 col1 col2 对的唯一值:

Solution2 is create unique values of pairs col1 and col2 by aggregation:

b_df = b_df.groupby(['col1', 'col2'], as_index=False).agg({'E':'mean', 'F': ','.join})
print (b_df)
   col1 col2    E    F
0     5    a  7.5  e,f
1     9    b  0.5  g,h

a_df = a_df.merge(b_df, how='left', on=['col1', 'col2'])
print (a_df)
   A  B  C  D  col1 col2    E    F
0  a  4  7  1     5    a  7.5  e,f
1  b  5  8  3     5    a  7.5  e,f
2  c  4  9  5     5    a  7.5  e,f
3  d  5  4  7     9    b  0.5  g,h
4  e  5  2  1     9    b  0.5  g,h
5  f  4  3  0     9    b  0.5  g,h

也可以通过df_b 中的所有重复项.rel ="nofollow noreferrer"> 重复 布尔索引 :

Also is possible check all dupes in df_b by duplicated and boolean indexing:

print (b_df[b_df.duplicated(['col1', 'col2'], keep=False)])

   E  F  col1 col2
0  7  e     5    a
1  8  f     5    a
2  0  g     9    b
3  1  h     9    b

这篇关于Pandas DataFrame合并,最终出现更多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆