构造相似矩阵的最有效方法 [英] Most efficient way to construct similarity matrix
问题描述
我正在使用以下链接创建欧几里得相似度矩阵"(我将其转换为DataFrame). https://stats.stackexchange.com/questions/53068/euclidean-distance-score-和相似性 http://docs.scipy .org/doc/scipy-0.14.0/reference/generation/scipy.spatial.distance.euclidean.html
I'm using the following links to create a "Euclidean Similarity Matrix" (that I convert to a DataFrame). https://stats.stackexchange.com/questions/53068/euclidean-distance-score-and-similarity http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.euclidean.html
我这样做的方式是一种有效的迭代方法,但是当数据集很大时需要一段时间.熊猫pd.DataFrame.corr()确实非常快速,对于皮尔逊相关性很有用.
The way I'm doing it is an iterative approach which works but it takes a while when the datasets are big. The pandas pd.DataFrame.corr() is really fast and useful for pearson correlations.
如何在不进行详尽迭代的情况下执行欧几里得相似性度量?
我下面的天真代码:
#Euclidean Similarity
#Create DataFrame
DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]}).T
DF_var.columns = ["g1","g2","g3"]
# g1 g2 g3
# s1 1.2 3.4 10.2
# s2 1.4 3.1 10.7
# s3 2.1 3.7 11.3
# s4 1.5 3.2 10.9
#Create empty matrix to fill
M_euclid = np.zeros((DF_var.shape[1],DF_var.shape[1]))
#Iterate through DataFrame columns to measure euclidean distance
for i in range(DF_var.shape[1]):
u = DF_var[DF_var.columns[i]]
for j in range(DF_var.shape[1]):
v = DF_var[DF_var.columns[j]]
#Euclidean distance -> Euclidean similarity
M_euclid[i,j] = (1/(1+sp.spatial.distance.euclidean(u,v)))
DF_euclid = pd.DataFrame(M_euclid,columns=DF_var.columns,index=DF_var.columns)
# g1 g2 g3
# g1 1.000000 0.215963 0.051408
# g2 0.215963 1.000000 0.063021
# g3 0.051408 0.063021 1.000000
推荐答案
您可以在scipy.spatial.distance
中使用两个有用的功能: squareform
.使用pdist
可以将观测值之间的成对距离作为一维数组,而squareform
可以将其转换为距离矩阵.
There are two useful function within scipy.spatial.distance
that you can use for this: pdist
and squareform
. Using pdist
will give you the pairwise distance between observations as a one-dimensional array, and squareform
will convert this to a distance matrix.
一个陷阱是,pdist
默认使用距离度量,而不是相似性,因此您需要手动指定相似性函数.从代码中的注释输出来看,您的DataFrame也不符合pdist
预期的方向,因此我撤消了您在代码中所做的转置.
One catch is that pdist
uses distance measures by default, and not similarity, so you'll need to manually specify your similarity function. Judging by the commented output in your code, your DataFrame is also not in the orientation pdist
expects, so I've undone the transpose you did in your code.
import pandas as pd
from scipy.spatial.distance import euclidean, pdist, squareform
def similarity_func(u, v):
return 1/(1+euclidean(u,v))
DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]})
DF_var.index = ["g1","g2","g3"]
dists = pdist(DF_var, similarity_func)
DF_euclid = pd.DataFrame(squareform(dists), columns=DF_var.index, index=DF_var.index)
这篇关于构造相似矩阵的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!