基于最近值合并 pandas 数据框 [英] Merging pandas dataframes based on nearest value(s)

查看:44
本文介绍了基于最近值合并 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框,比如 AB,它们有一些名为 attr1attr2 的列,attrN.

I have two dataframes, say A and B, that have some columns named attr1, attr2, attrN.

我有一定的距离度量,我想合并数据帧,这样 A 中的每一行都与 B 中最短的行合并属性之间的距离.请注意,B 中的行在合并时可以重复.

I have a certain distance measure, and I would like to merge the dataframes, such that each row in A is merged with the row in B that has the shortest distance between attributes. Note that rows in B can be repeated when merging.

例如(为了简单起见,使用一个属性),使用绝对差异距离合并这两个表 |A.attr1 - B.att1|

For example (with one attribute to keep things simple), merging these two tables using absolute difference distance |A.attr1 - B.att1|

A | attr1      B | attr1
0 | 10         0 | 15
1 | 20         1 | 27
2 | 30         2 | 80

应该产生以下合并表

M | attr1_A  attr1_B
0 | 10       15
1 | 20       15
2 | 30       27

我目前这样做的方法很慢,并且基于将 A 的每一行与 B 的每一行进行比较,但是代码也不清楚,因为我必须保留用于合并的索引,我一点也不满意,但我想不出更好的解决方案.

My current way of doing this is slow and is based on comparing each row of A with each row of B, but code is also not clear because I have to preserve indices for merging and I am not satisfied at all, but I cannot come up with a better solution.

如何使用 Pandas 执行上述合并?有什么方便的方法或功能可以在这里提供帮助吗?

How can I perform the merge as above using pandas? Are there any convenience methods or functions that can be helpful here?

只是为了澄清,在数据框中还有其他列未用于距离计算,但也必须合并.

Just to clarify, in the dataframes there are also other columns which are not used in the distance calculation, but have to be merged as well.

推荐答案

一种方法如下:

A = pd.DataFrame({'attr1':[10,20,30]})
B = pd.DataFrame({'attr1':[15,15,27]})

使用交叉连接获得所有组合

Use a cross join to get all combinations

更新 1.2+ 熊猫使用 how='cross'

merge_AB = A.merge(B, how='cross', suffixes = ('_A', '_B'))

旧版熊猫使用伪密钥...

Older pandas version use psuedo key...

A = A.assign(key=1)
B = B.assign(key=1)

merged_AB =pd.merge(A,B, on='key',suffixes=('_A','_B'))

现在让我们找到merged_AB中的最小距离

Now let's find the min distances in merged_AB

M = merged_AB.groupby('attr1_A').apply(lambda x:abs(x['attr1_A']-x['attr1_B'])==abs(x['attr1_A']-x['attr1_B']).min())

merged_AB[M.values].drop_duplicates().drop('key',axis=1)

输出:

   attr1_A  attr1_B
0       10       15
3       20       15
8       30       27

这篇关于基于最近值合并 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆