Python Pandas:如何基于"OR"合并健康)状况? [英] Python Pandas: How to merge based on an "OR" condition?

查看:62
本文介绍了Python Pandas:如何基于"OR"合并健康)状况?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我有两个数据框,两者的列名分别是:

table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]

我想基于ShipNumber和TrackNumber合并两个表. 但是,如果我只是通过以下方式使用合并(伪代码,而不是实际代码):

tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])

然后,这意味着两个表中的ShipNumber和TrackNumber列中的值都必须匹配.

但是,就我而言,有时ShipNumber列值将匹配,有时TrackNumber列值将匹配; 只要两个值之一连续匹配,我希望合并发生.

换句话说,如果选项卡1中的第1行ShipNumber与选项卡2中的第3行ShipNumber相匹配,但是两个表的两个记录中的TrackNumber不匹配,我仍然要匹配两个表中的两行. /p>

所以基本上这是一个或"或匹配"条件(伪代码):

if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
    then merge

我希望我的问题有道理... 任何帮助都非常感谢!

根据建议,我调查了这篇文章: Python熊猫与OR逻辑合并 但是我认为这不是完全相同的问题,因为该帖子中的OP具有映射文件,因此他们可以简单地进行2次合并来解决此问题.但是我没有映射文件,相反,我有两个具有相同键列(ShipNumber,TrackNumber)的df

解决方案

使用merge()concat().然后删除所有AB都匹配的重复案例(感谢@Scott Boston做最后一步).

df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})

df1         df2
   A  B        A  B
0  1  7     0  1  4
1  2  8     1  5  1
2  3  9     2  6  8
3  4  5     3  4  5

有了这些数据框,我们应该看到:

  • df1.loc[0]匹配df2.loc[0]上的A
  • df1.loc[1]匹配df2.loc[2]上的B
  • df1.loc[3]匹配df2.loc[3]上的AB

我们将使用后缀跟踪与以下内容匹配的内容:

suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']

pd.concat([df1.merge(df2, on='A', suffixes=suff_A), 
           df1.merge(df2, on='B', suffixes=suff_B)])

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
1  4.0             NaN             NaN  NaN             5.0             5.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

请注意,第二和第四行是重复的匹配项(对于两个数据帧,分别为A = 4B = 5).我们需要删除其中一组.

dupes = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dupes]

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

Let's say I have two dataframes, and the column names for both are:

table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]

I want to merge the two tables based on both ShipNumber and TrackNumber. However, if i simply use merge in the following way (pseudo code, not real code):

tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])

then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.

However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.

In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.

So basically this is a either/or match condition (pesudo code):

if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
    then merge

I hope my question makes sense... Any help is really really appreciated!

As suggested, I looked into this post: Python pandas merge with OR logic But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)

解决方案

Use merge() and concat(). Then drop any duplicate cases where both A and B match (thanks @Scott Boston for that final step).

df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})

df1         df2
   A  B        A  B
0  1  7     0  1  4
1  2  8     1  5  1
2  3  9     2  6  8
3  4  5     3  4  5

With these data frames we should see:

  • df1.loc[0] matches A on df2.loc[0]
  • df1.loc[1] matches B on df2.loc[2]
  • df1.loc[3] matches both A and B on df2.loc[3]

We'll use suffixes to keep track of what matched where:

suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']

pd.concat([df1.merge(df2, on='A', suffixes=suff_A), 
           df1.merge(df2, on='B', suffixes=suff_B)])

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
1  4.0             NaN             NaN  NaN             5.0             5.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5). We need to remove one of those sets.

dupes = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dupes]

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

这篇关于Python Pandas:如何基于"OR"合并健康)状况?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆