Pandas:在一列上近似连接,在其他列上完全匹配 [英] Pandas: Approximate join on one column, exact match on other columns
问题描述
我有两个 Pandas 数据框,我想在多列(比如 3)和大约(即最近的邻居)上的一个(日期)列上加入/合并.我也想返回它们之间的差异(天).每个数据集大约有 50,000 行长.我对内连接最感兴趣,但剩菜"也很有趣,如果不是太难掌握的话.大多数完全匹配"观测值将在每个数据框中多次出现.
I have two pandas dataframes I want to join/merge exactly on a number of columns (say 3) and approximately, i.e nearest neighbour, on one (date) column. I also want to return the difference (days) between them. Each dataset is about 50,000 rows long. I'm most interested in an inner join, but the "leftovers" are also interesting if not too hard to get hold of. Most of the "exact match" observations will exist multiple times in each data frame.
我一直在尝试使用 difflib.get_close_matches 将它们全部连接为字符串(这很愚蠢,我知道!)但并不总是给出完全匹配.我想我需要先遍历精确匹配项,然后在该组中找到最近的匹配项,但我似乎无法正确...
I've been trying to use difflib.get_close_matches on the concatenation of all of them as strings (which is stupid, I know!) but is does not always give exact matches. I suppose I need to loop over the exact matches first and then find nearest matches within this group, but I just can't seem to get it right...
数据框看起来像:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
df1
Out[430]:
col1 col2 col3 date
index
a1 1232 asd 1 2010-01-23
a2 432 dsa12 2 2016-05-20
a3 432 dsa12 2 2010-06-20
a4 123 asd2 3 2008-10-21
df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')
df2
Out[434]:
col1 col2 col3 date
index
b1 132 asd 1 2010-01-23
b2 432 dsa12 2 2016-05-23
b3 432 dsa12 2 2010-06-10
b4 123 sd2 3 2008-10-21
最后我想要的是:
col1 col2 col3 date diff match_index
index
a1 1232 asd 1 2010-01-23 nan nan
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
a4 123 asd2 3 2008-10-21 nan nan
a5 123 sd2 3 2008-10-21 nan b4
或者如果仅使用内部连接更容易,我想要:
or if it's way easier with just an inner join I'd like:
col1 col2 col3 date diff match_index
index
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
推荐答案
我不确定这是否适合.它或多或少地实现了您想要的,但实际上并没有执行合并.它遵循与此问题相同的想法 除了不是仅基于一列对 df1
进行子集化,这里我们使用 groupby
匹配多个列,并在两个数据帧上进行匹配.如果您确实想明确包含 merge
命令并且对内部联接感到满意,请检查答案的最底部,它包含一个片段.
I am not sure if this suits. It achieves more or less what you want but does not actually perform a merge. It follows the same idea as this question except instead of subsetting the df1
based on only one column, here we match on multiple columns using a groupby
and do it on both dataframes. IF you do want to explicitly include the merge
command and are happy with an inner join then check the very bottom of the answer, it includes a snippet for that.
import pandas as pd
from sklearn.neighbors import NearestNeighbors
def find_nearest(group, df2, groupname):
try:
match = df2.groupby(groupname).get_group(group.name)
match['date'] = pd.to_datetime(match.date, unit = 'D')
nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])
dist, ind = nbrs.kneighbors(group['date'].values[:, None])
group['date1'] = group['date']
group['date'] = match['date'].values[ind.ravel()]
group['diff'] = (group['date1']-group['date'])
group['match_index'] = match.index[ind.ravel()]
return group
except KeyError:
return group
#change dates from string to datetime
df1['date'] = pd.to_datetime(df1.date, unit = 'D')
df2['date'] = pd.to_datetime(df2.date, unit = 'D')
#find closest dates and differences
keys = ['col1', 'col2', 'col3']
df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)
#fill unmatched dates
df1_mod.date1.fillna(df1_mod.date, inplace=True)
df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys)
df2_mod.date1.fillna(df2_mod.date, inplace=True)
#drop original column
df1_mod.drop('date', inplace=True, axis=1)
df1_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod.drop('date', inplace=True, axis=1)
df2_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod['diff'] = -df2_mod['diff']
#drop redundant values
df2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)
#merge the two
df_final = pd.merge(df1_mod, df2_mod, how='outer')
结果如下:
In [349]: df_final
Out[349]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN
使用合并命令:
In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:
col1 col2 col3 date diff match_index
0 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
评论中考虑的案例,即:
The case considered in the comments, namely:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
产生以下结果:
In [351]: df_final
Out[351]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 1432 dsa12 2 2016-05-20 NaT NaN
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN
这篇关于Pandas:在一列上近似连接,在其他列上完全匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!