pandas :在约束下对每对列应用函数 [英] Pandas: Apply function over each pair of columns under constraints

查看:70
本文介绍了 pandas :在约束下对每对列应用函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如标题所说,在某些情况下,我试图在数据框的每对列上应用一个函数.我将尝试说明这一点.我的df格式为:

As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:

Code |  14  |  17  |  19  | ...
w1   |  0   |   5  |   3  | ...
w2   |  2   |   5  |   4  | ... 
w3   |  0   |   0  |   5  | ...

该代码对应于矩形网格中确定的位置,并且ws是不同的词.我只想在每对列之间应用余弦相似度度量(已编辑!) ,如果该对中的一列中的项总和大于5时..

The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only (EDITED!) if the sum of items in one of the columns of the pair is greater thah 5.

所需的输出如下:

     | [14,17]  |  [14,19]  |  [14,...]  |  [17,19]  | ...
Sim  |cs(14,17) |cs(14,19)  |cs(14,...)  |cs(17,19)..| ...

cs是每对列的余弦相似度的结果. 有什么合适的方法可以做到这一点?

cs is the result of the cosine similarity for each pair of columns. Is there any suitable method to do this?

任何帮助将不胜感激:-)

Any help would be appreciated :-)

推荐答案

要将余弦度量标准应用于来自两个输入集合的每一对,您可以 可以使用 scipy.spatial.distance.cdist .这将比 使用双Python循环.

To apply the cosine metric to each pair from two collections of inputs, you could use scipy.spatial.distance.cdist. This will be much much faster than using a double Python loop.

让一个集合成为df的所有列.令其他集合仅是总和大于5的那些列:

Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:

import pandas as pd
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]

然后可以通过一次调用cdist来计算所有余弦相似度:

Then all the cosine similarities can be computed with one call to cdist:

import scipy.spatial.distance as SSD
values = SSD.cdist(df2.T, df.T, metric='cosine')
# array([[  2.92893219e-01,   1.11022302e-16,   3.00000000e-01],
#        [  4.34314575e-01,   3.00000000e-01,   1.11022302e-16]])

这些值可以包装在新的DataFrame中并重新调整形状:

The values can be wrapped in a new DataFrame and reshaped:

result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()


import pandas as pd
import scipy.spatial.distance as SSD
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
values = SSD.cdist(df2.T, df.T, metric='cosine')
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
mask = result.index.get_level_values(0) != result.index.get_level_values(1)
result = result.loc[mask]
print(result)

产生系列

17  14    0.292893
    19    0.300000
19  14    0.434315
    17    0.300000

这篇关于 pandas :在约束下对每对列应用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆