用于匹配多列中的值的函数 [英] Function for matching values in multiple columns

查看:81
本文介绍了用于匹配多列中的值的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用以下测试数据:

df2 = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = .3
df2['matches'] = np.where(df2.A - df2.B < thresh,1,0)

我创建了df2['matches']列,当df2.A - df2.B < thresh时显示了1值.

I created the df2['matches'] column showing a value of 1 when df2.A - df2.B < thresh.

        A           B            C      matches
0   0.501554    -0.589855   -0.751568   0
1   -0.295198   0.512442    0.466915    1
2   0.074863    0.343388    -1.700998   1
3   0.115432    -0.507847   -0.825545   0
4   1.013837    -0.007333   -0.292192   0
5   -0.930738   1.235501    -0.652071   1
6   -1.026615   1.389294    0.035041    1
7   0.969147    -0.397276   1.272235    0
8   0.120461    -0.634686   -1.123046   0
9   0.956896    -0.345948   -0.620748   0
10  -0.552476   1.376459    0.447807    1
11  0.882275    0.490049    0.713033    0

但是,我实际上想比较所有三列,如果值在thresh之内,它将返回一个与df2['matches]中的匹配量相对应的数字.

However, I actually would like to compare all three columns and if the values are within thresh it will return a number corresponding with the amount of matches in df2['matches].

例如,如果Col A = 1,B = 2和C = 1.5并且阈值是0.5,则该函数将在['matches']列中返回3.

So for example if Col A = 1, B = 2 and C = 1.5 and thresh was .5 the function would return 3 in the ['matches'] column.

有没有已经执行类似操作的函数,或者任何人都可以帮忙吗?

Is there a function that already does something similar or can anyone help with this?

推荐答案

您可以为每对列使用阈值,然后对布尔值列求和以求出所需的数目.但是请注意,此数字取决于您比较列的顺序.如果使用abs(df['A']-df['B'])等,这种歧义将消失,这很可能是您的意图.下面我假设这是您所需要的.

You can use the threshold for each pair of your columns, then sum up the resulting boolean columns to obtain the number you need. Note, however, that this number depends on the order in which you compare columns. This ambiguity would be gone if you used abs(df['A']-df['B']) etc, and this might very well be your intention. Below I'll assume this is what you need.

通常,您可以使用itertools.combinations一次生成每对列:

Generally, you can use itertools.combinations to produce each pair of columns once:

from itertools import combinations
df = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = .3
df['matches'] = sum(abs(df[k1]-df[k2])<thresh for k1,k2 in combinations(df.keys(),2))

sum()中的生成器表达式遍历每个列对,并构造各自的布尔向量.将每个列对的这些值相加,然后将结果列添加到数据框.

The generator expression in the sum() loops over every column pair, and constructs the respective boolean vector. These are summed for each column pair, and the resulting column is appended to the dataframe.

thresh = 0.3的示例输出:

           A         B         C  matches
0   0.146360 -0.099707  0.633632        1
1   1.462810 -0.186317 -1.411988        0
2   0.358827 -0.758619  0.038329        0
3   0.077122 -0.213856 -0.619768        1
4   0.215555  1.930888 -0.488517        0
5  -0.946557 -0.904743 -0.004738        1
6  -0.080209 -0.850830 -0.866865        1
7  -0.997710 -0.580679 -2.231168        0
8   1.762313 -0.356464 -1.813028        0
9   1.151338  0.347636 -1.323791        0
10  0.248432  1.265484  0.048484        1
11  0.559934 -0.401059  0.863616        0

使用itertools.combinations,将列进行比较

>>> [k for k in itertools.combinations(df.keys(),2)]
('A', 'B'), ('A', 'C'), ('B', 'C')]

但是,如果您使用绝对值,那么这实际上并不重要(因为这样差异就列而言是对称的).

but this really doesn't matter if you're using the absolute value (since then the difference is symmetric with respect to columns).

这篇关于用于匹配多列中的值的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆