在 scikit-learn 中对 MDS 使用相似矩阵而不是相异矩阵 [英] Use similarity matrix instead of dissimilarity matrix for MDS in scikit-learn

查看:84
本文介绍了在 scikit-learn 中对 MDS 使用相似矩阵而不是相异矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将我使用 scikit-learn 的 TfidfVectorizer 的文本文档的相似性可视化为 tfidf = TfidfVectorizer(decode_error='ignore', max_df=3).fit_transform(data)

I want to visualize similarity of text documents for which I am using scikit-learn's TfidfVectorizer as tfidf = TfidfVectorizer(decode_error='ignore', max_df=3).fit_transform(data)

然后执行余弦相似度计算为 cosine_similarity = (tfidf*tfidf.T).toarray()

and then performing cosine similarity calculation as cosine_similarity = (tfidf*tfidf.T).toarray()

给出相似性,但 sklearn.manifold.MDS 需要一个相异矩阵.当我给出 1-cosine_similarity 时,应该为零的对角线值不为零.它们是一些小的值,例如 1.12e-9 等.两个问题:

which gives similarity but sklearn.manifold.MDS needs a dissimilarity matrix. When I give 1-cosine_similarity, the diagonal values which should be zero, are not zero. They are some small value like 1.12e-9 etc. Two questions:

1) 如何将相似矩阵用于 MDS 或如何将相似矩阵更改为相异矩阵?

1) How do I use similarity matrix for MDS or how do I change my similarity matrix to dissimilarity matrix?

2) 在 MDS 中,有一个选项 dissimilarity,其值可以是 'precomputed''euclidean'.两者之间有什么区别,因为当我给出欧几里得时,无论我使用的是 cosine_similarity 还是 1-cosine_similarity,MDS 坐标都是一样的.

2) In MDS, there is an option dissimilarity, the values of which can be 'precomputed' or 'euclidean'. What's the difference between the two because when I give euclidean, the MDS coordinates come to be same regardless of whether I use cosine_similarity or 1-cosine_similarity which looks wrong.

谢谢!

推荐答案

我不太了解你的余弦变换(因为我没有看到涉及余弦/角度/归一化标量积),我不知道 TfidfVectorizer 功能,但我将尝试回答您的两个问题:

I do not really understand your cosine transformation (as I see no cosine/angle/normalized scalar product being involved) and I do not know the TfidfVectorizer functionality but I will try to answer your two questions:

1) 通常,(dissimilarity = 1-similarity)-方法适用于矩阵中所有条目都在 -1 和 1 之间的情况.假设距离矩阵 d = cosine_similarity 是一个这样的对称距离矩阵您可以应用到数字人工制品

1) Generally the (dissimilarity = 1-similarity)-approach is valid for cases in which all the entries in the matrix are between -1 and 1. Assuming the distance matrix d = cosine_similarity is a such a symmetric distance matrix up to numerical artefacts you can apply

dissimilarity_clean = 1 - np.triu(d)+np.triu(d).T-np.diag(np.ones(len(d)))

纠正伪影.使用 numpys corrcoef(X) 创建基于 Pearson 相关系数的相异矩阵时,可能需要相同的操作.两侧节点: 1. 对于无界相似性度量,您仍然可以提出等效的方法.2. 在使用 MDS 的情况下,您可能会考虑使用更接近欧几里得距离(且无界)的度量,因为这将是 MDS 更自然的选择并导致更好的结果.

to correct for the artefacts. The same operation can be needed when using numpys corrcoef(X) to create a dissimilarity matrix based on Pearson correlation coefficients. Two side nodes: 1. For non-bounded similarity measures you can still come up with equivalent approaches. 2. In case of the use for MDS you might consider using a measure which is closer to euclidean distance (and not bounded) as this would be a more natural choice for MDS and lead to better results.

2) 使用 'precomputed' 选项假设您使用预先计算的相异矩阵(您的场景)为 MDS 的 .fit(X=dissimilarity matrix) 方法提供数据.使用 dissimilarity = 'euclidean' 将计算您传递给 .fit(X=data) 的数据的欧几里德距离矩阵.

2) Using the 'precomputed' option assumes that you feed the .fit(X=dissimilarity matrix)-method of MDS with a dissimilarity matrix that you precomputed (your scenario). Using dissimilarity = 'euclidean' instead would compute the euclidean distance matrix of the data that you pass to .fit(X=data).

希望这会有所帮助!

这篇关于在 scikit-learn 中对 MDS 使用相似矩阵而不是相异矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆