用于匹配多列中的值的函数 [英] Function for matching values in multiple columns
问题描述
使用以下测试数据:
df2 = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = .3
df2['matches'] = np.where(df2.A - df2.B < thresh,1,0)
我创建了df2['matches']
列,当df2.A - df2.B < thresh
时显示了1
值.
I created the df2['matches']
column showing a value of 1
when df2.A - df2.B < thresh
.
A B C matches
0 0.501554 -0.589855 -0.751568 0
1 -0.295198 0.512442 0.466915 1
2 0.074863 0.343388 -1.700998 1
3 0.115432 -0.507847 -0.825545 0
4 1.013837 -0.007333 -0.292192 0
5 -0.930738 1.235501 -0.652071 1
6 -1.026615 1.389294 0.035041 1
7 0.969147 -0.397276 1.272235 0
8 0.120461 -0.634686 -1.123046 0
9 0.956896 -0.345948 -0.620748 0
10 -0.552476 1.376459 0.447807 1
11 0.882275 0.490049 0.713033 0
但是,我实际上想比较所有三列,如果值在thresh
之内,它将返回一个与df2['matches]
中的匹配量相对应的数字.
However, I actually would like to compare all three columns and if the values are within thresh
it will return a number corresponding with the amount of matches in df2['matches]
.
例如,如果Col A = 1,B = 2和C = 1.5并且阈值是0.5,则该函数将在['matches']列中返回3.
So for example if Col A = 1, B = 2 and C = 1.5 and thresh was .5 the function would return 3 in the ['matches'] column.
有没有已经执行类似操作的函数,或者任何人都可以帮忙吗?
Is there a function that already does something similar or can anyone help with this?
推荐答案
您可以为每对列使用阈值,然后对布尔值列求和以求出所需的数目.但是请注意,此数字取决于您比较列的顺序.如果使用abs(df['A']-df['B'])
等,这种歧义将消失,这很可能是您的意图.下面我假设这是您所需要的.
You can use the threshold for each pair of your columns, then sum up the resulting boolean columns to obtain the number you need. Note, however, that this number depends on the order in which you compare columns. This ambiguity would be gone if you used abs(df['A']-df['B'])
etc, and this might very well be your intention. Below I'll assume this is what you need.
通常,您可以使用itertools.combinations
一次生成每对列:
Generally, you can use itertools.combinations
to produce each pair of columns once:
from itertools import combinations
df = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = .3
df['matches'] = sum(abs(df[k1]-df[k2])<thresh for k1,k2 in combinations(df.keys(),2))
sum()
中的生成器表达式遍历每个列对,并构造各自的布尔向量.将每个列对的这些值相加,然后将结果列添加到数据框.
The generator expression in the sum()
loops over every column pair, and constructs the respective boolean vector. These are summed for each column pair, and the resulting column is appended to the dataframe.
thresh = 0.3
的示例输出:
A B C matches
0 0.146360 -0.099707 0.633632 1
1 1.462810 -0.186317 -1.411988 0
2 0.358827 -0.758619 0.038329 0
3 0.077122 -0.213856 -0.619768 1
4 0.215555 1.930888 -0.488517 0
5 -0.946557 -0.904743 -0.004738 1
6 -0.080209 -0.850830 -0.866865 1
7 -0.997710 -0.580679 -2.231168 0
8 1.762313 -0.356464 -1.813028 0
9 1.151338 0.347636 -1.323791 0
10 0.248432 1.265484 0.048484 1
11 0.559934 -0.401059 0.863616 0
使用itertools.combinations
,将列进行比较
>>> [k for k in itertools.combinations(df.keys(),2)]
('A', 'B'), ('A', 'C'), ('B', 'C')]
但是,如果您使用绝对值,那么这实际上并不重要(因为这样差异就列而言是对称的).
but this really doesn't matter if you're using the absolute value (since then the difference is symmetric with respect to columns).
这篇关于用于匹配多列中的值的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!