根据三列将一个Pandas数据框中的行与另一个数据框中的行进行匹配 [英] Match rows in one Pandas dataframe to another based on three columns

查看:280
本文介绍了根据三列将一个Pandas数据框中的行与另一个数据框中的行进行匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个Pandas数据帧,一个很大(30000+行),另一个很小(100+行).

I have two Pandas dataframes, one quite large (30000+ rows) and one a lot smaller (100+ rows).

dfA类似于:

      X     Y    ONSET_TIME    COLOUR 
0   104    78          1083         6    
1   172    78          1083        16
2   240    78          1083        15 
3   308    78          1083         8
4   376    78          1083         8
5   444    78          1083        14
6   512    78          1083        14
... ...   ...           ...       ...

dfB看起来像:

    TIME     X     Y
0      7   512   350 
1   1722   512   214 
2   1906   376   214 
3   2095   376   146 
4   2234   308    78 
5   2406   172   146
...  ...   ...   ...  

我要为dfB中的每一行找到dfA中的X AND Y列值相等且这是dfB ['TIME']值大于第一行的第一行dfA ['ONSET_TIME']并为此行返回dfA ['COLOUR']的值.

What I want to do is for every row in dfB to find the row in dfA where the values of the X AND Y columns are equal AND that is the first row where the value of dfB['TIME'] is greater than dfA['ONSET_TIME'] and return the value of dfA['COLOUR'] for this row.

dfA表示显示器的刷新,其中X和Y是显示器上项目的坐标,因此对于每个不同的ONSET_TIME(每个ONSET_TIME值有108对余弦)重复它们自己.

dfA represents refreshes of a display, where X and Y are coordinates of items on the display and so repeat themselves for every different ONSET_TIME (there are 108 pairs of coodinates for each value of ONSET_TIME).

会有多个行,两个数据帧中的X和Y相等,但是我也需要与时间匹配的行.

There will be multiple rows where the X and Y in the two dataframes are equal, but I need the one that matches the time too.

我已经使用for循环和if语句来完成此操作,只是为了查看它是否可以完成,但是显然,鉴于数据帧的大小,这需要很长时间.

I have done this using for loops and if statements just to see that it could be done, but obviously given the size of the dataframes this takes a very long time.

for s in range(0, len(dfA)):
    for r in range(0, len(dfB)):
        if (dfB.iloc[r,1] == dfA.iloc[s,0]) and (dfB.iloc[r,2] == dfA.iloc[s,1]) and (dfA.iloc[s,2] <= dfB.iloc[r,0] < dfA.iloc[s+108,2]):
            return dfA.iloc[s,3]

推荐答案

可能有更有效的方法,但是这里的方法没有那些慢的for循环:

There is probably an even more efficient way to do this, but here is a method without those slow for loops:

import pandas as pd

dfB = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3], 'Time':[10,20,30]})
dfA = pd.DataFrame({'X':[1,1,2,2,2,3],'Y':[1,1,2,2,2,3], 'ONSET_TIME':[5,7,9,16,22,28],'COLOR': ['Red','Blue','Blue','red','Green','Orange']})

#create one single table
mergeDf = pd.merge(dfA, dfB, left_on = ['X','Y'], right_on = ['X','Y'])
#remove rows where time is less than onset time
filteredDf = mergeDf[mergeDf['ONSET_TIME'] < mergeDf['Time']]
#take min time (closest to onset time)
groupedDf = filteredDf.groupby(['X','Y']).max()

print filteredDf

 COLOR  ONSET_TIME  X  Y  Time
0     Red           5  1  1    10
1    Blue           7  1  1    10
2    Blue           9  2  2    20
3     red          16  2  2    20
5  Orange          28  3  3    30


print groupedDf

COLOR  ONSET_TIME  Time
X Y                          
1 1     Red           7    10
2 2     red          16    20
3 3  Orange          28    30

基本思想是合并两个表,以便将时间一起放在一个表中.然后,我筛选了最大的记录(最接近dfB上的时间).如果您对此有疑问,请告诉我.

The basic idea is to merge the two tables so you have the times together in one table. Then I filtered on the recs that are the largest (closest to the time on your dfB). Let me know if you have questions about this.

这篇关于根据三列将一个Pandas数据框中的行与另一个数据框中的行进行匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆