在 scikit-learn 中对 MDS 使用相似矩阵而不是相异矩阵 [英] Use similarity matrix instead of dissimilarity matrix for MDS in scikit-learn
问题描述
我想将我使用 scikit-learn 的 TfidfVectorizer 的文本文档的相似性可视化为 tfidf = TfidfVectorizer(decode_error='ignore', max_df=3).fit_transform(data)
I want to visualize similarity of text documents for which I am using scikit-learn's TfidfVectorizer as tfidf = TfidfVectorizer(decode_error='ignore', max_df=3).fit_transform(data)
然后执行余弦相似度计算为 cosine_similarity = (tfidf*tfidf.T).toarray()
and then performing cosine similarity calculation as cosine_similarity = (tfidf*tfidf.T).toarray()
给出相似性,但 sklearn.manifold.MDS
需要一个相异矩阵.当我给出 1-cosine_similarity 时,应该为零的对角线值不为零.它们是一些小的值,例如 1.12e-9
等.两个问题:
which gives similarity but sklearn.manifold.MDS
needs a dissimilarity matrix. When I give 1-cosine_similarity, the diagonal values which should be zero, are not zero. They are some small value like 1.12e-9
etc. Two questions:
1) 如何将相似矩阵用于 MDS 或如何将相似矩阵更改为相异矩阵?
1) How do I use similarity matrix for MDS or how do I change my similarity matrix to dissimilarity matrix?
2) 在 MDS 中,有一个选项 dissimilarity
,其值可以是 'precomputed'
或 'euclidean'
.两者之间有什么区别,因为当我给出欧几里得时,无论我使用的是 cosine_similarity 还是 1-cosine_similarity,MDS 坐标都是一样的.
2) In MDS, there is an option dissimilarity
, the values of which can be 'precomputed'
or 'euclidean'
. What's the difference between the two because when I give euclidean, the MDS coordinates come to be same regardless of whether I use cosine_similarity or 1-cosine_similarity which looks wrong.
谢谢!
推荐答案
我不太了解你的余弦变换(因为我没有看到涉及余弦/角度/归一化标量积),我不知道 TfidfVectorizer 功能,但我将尝试回答您的两个问题:
I do not really understand your cosine transformation (as I see no cosine/angle/normalized scalar product being involved) and I do not know the TfidfVectorizer functionality but I will try to answer your two questions:
1) 通常,(dissimilarity = 1-similarity)-方法适用于矩阵中所有条目都在 -1 和 1 之间的情况.假设距离矩阵 d = cosine_similarity 是一个这样的对称距离矩阵您可以应用到数字人工制品
1) Generally the (dissimilarity = 1-similarity)-approach is valid for cases in which all the entries in the matrix are between -1 and 1. Assuming the distance matrix d = cosine_similarity is a such a symmetric distance matrix up to numerical artefacts you can apply
dissimilarity_clean = 1 - np.triu(d)+np.triu(d).T-np.diag(np.ones(len(d)))
纠正伪影.使用 numpys corrcoef(X) 创建基于 Pearson 相关系数的相异矩阵时,可能需要相同的操作.两侧节点: 1. 对于无界相似性度量,您仍然可以提出等效的方法.2. 在使用 MDS 的情况下,您可能会考虑使用更接近欧几里得距离(且无界)的度量,因为这将是 MDS 更自然的选择并导致更好的结果.
to correct for the artefacts. The same operation can be needed when using numpys corrcoef(X) to create a dissimilarity matrix based on Pearson correlation coefficients. Two side nodes: 1. For non-bounded similarity measures you can still come up with equivalent approaches. 2. In case of the use for MDS you might consider using a measure which is closer to euclidean distance (and not bounded) as this would be a more natural choice for MDS and lead to better results.
2) 使用 'precomputed' 选项假设您使用预先计算的相异矩阵(您的场景)为 MDS 的 .fit(X=dissimilarity matrix) 方法提供数据.使用 dissimilarity = 'euclidean' 将计算您传递给 .fit(X=data) 的数据的欧几里德距离矩阵.
2) Using the 'precomputed' option assumes that you feed the .fit(X=dissimilarity matrix)-method of MDS with a dissimilarity matrix that you precomputed (your scenario). Using dissimilarity = 'euclidean' instead would compute the euclidean distance matrix of the data that you pass to .fit(X=data).
希望这会有所帮助!
这篇关于在 scikit-learn 中对 MDS 使用相似矩阵而不是相异矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!