余弦距离作为k均值的向量距离函数 [英] Cosine distance as vector distance function for k-means

查看:112
本文介绍了余弦距离作为k均值的向量距离函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个N个顶点的图形,其中每个顶点代表一个位置.我还有一个向量,每个用户一个,N个系数中的每个系数,其中系数的值是在相应位置花费的持续时间(以秒为单位),如果未访问该位置,则为0.

I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited.

例如对于图形:

向量:

v1 = {100, 50, 0 30, 0}

这意味着我们要花费:

100secs at vertex 1
50secs at vertex 2 and 
30secs at vertex 4 

(顶点3和5未被访问,因此为0).

(vertices 3 & 5 where not visited, thus the 0s).

我想进行k-均值聚类,并且选择cosine_distance = 1 - cosine_similarity作为距离的度量标准,其中cosine_similarity的公式为:

I want to run a k-means clustering and I've chosen cosine_distance = 1 - cosine_similarity as the metric for the distances, where the formula for cosine_similarity is:

此处所述.

但是我注意到了以下情况.假设k=2并且向量之一是:

But I noticed the following. Assume k=2 and one of the vectors is:

v1 = {90,0,0,0,0}

在解决使与候选质心的总距离最小化的优化问题的过程中,假定在某个点上有两个候选质心为:

In the process of solving the optimization problem of minimizing the total distance from candidate centroids, assume that at some point, 2 candidate centroids are:

c1 = {90,90,90,90,90}
c2 = {1000, 1000, 1000, 1000, 1000}

运行(v1,c1)和(v1,c2)的cosine_distance公式,我们得到的0.5527864045距离完全相同.

Running the cosine_distance formula for (v1, c1) and (v1, c2) we get exactly the same distance of 0.5527864045 for both.

我认为v1与c1的相似度(与c2的相似度更高).显然不是这样.

I would assume that v1 is more similar (closer) to c1 than c2. Apparently this is not the case.

Q1.为什么这个假设是错误的?

Q1. Why is this assumption wrong?

Q2.在这种情况下,余弦距离是否是正确的距离函数?

Q2. Is the cosine distance a correct distance function for this case?

Q3.考虑到问题的性质,哪种更好?

Q3. What would be a better one given the nature of the problem?

推荐答案

让我们将余弦相似度分为几部分,看看如何为什么起作用.

Let's divide cosine similarity into parts and see how and why it works.

两个向量ab之间的余弦定义为:

Cosine between 2 vectors - a and b - is defined as:

cos(a, b) = sum(a .* b) / (length(a) * length(b))

其中,.*是逐元素的乘法.分母仅用于规范化,因此我们将其简称为L.有了它,我们的功能就会变成:

where .* is an element-wise multiplication. Denominator is here just for normalization, so let's simply call it L. With it our functions turns into:

cos(a, b) = sum(a .* b) / L

可以将其改写为:

cos(a, b) = (a[1]*b[1] + a[2]*b[2] + ... + a[k]*b[k]) / L = 
          = a[1]*b[1]/L + a[2]*b[2]/L + ... + a[k]*b[k]/L

让我们更抽象一些,并用函数g(x, y)替换x * y / L(这里的L是常量,因此我们不将其用作函数参数).因此,我们的余弦函数变为:

Let's get a bit more abstract and replace x * y / L with function g(x, y) (L here is constant, so we don't put it as function argument). Our cosine function thus becomes:

cos(a, b) = g(a[1], b[1]) + g(a[2], b[2]) + ... + g(a[n], b[n]) 

也就是说,每对元素(a[i], b[i])都分别进行处理,结果只是所有处理的总和.这对您的情况是有好处的,因为您不希望不同的对(不同的顶点)相互干扰:如果user1仅访问了vertex2和user2-仅访问了vertex1,则它们之间没有任何共同点,并且它们之间的相似性应该是零.您实际上不喜欢的是如何计算各个对之间(即函数g())之间的相似性.

That is, each pair of elements (a[i], b[i]) is treated separately, and result is simply sum of all treatments. And this is good for your case, because you don't want different pairs (different vertices) to mess with each other: if user1 visited only vertex2 and user2 - only vertex1, then they have nothing in common, and similarity between them should be zero. What you actually don't like is how similarity between individual pairs - i.e. function g() - is calculated.

通过余弦函数,各对之间的相似性如下所示:

With cosine function similarity between individual pairs looks like this:

g(x, y) = x * y / L

其中,xy表示用户在顶点上花费的时间.这是主要问题:乘法是否很好地表示了各个对之间的相似性?我不这么认为.在某个顶点上花费90秒的用户应该与在那儿花费70秒或110秒的用户接近,但与在那儿花费1000或0秒的用户相距更远.乘法(甚至用L归一化)在这里完全令人误解.乘以两个时间段甚至意味着什么?

where x and y represent time users spent on the vertex. And here's the main question: does multiplication represent similarity between individual pairs well? I don't think so. User who spent 90 seconds on some vertex should be close to user who spent there, say, 70 or 110 seconds, but much more far from users who spend there 1000 or 0 seconds. Multiplication (even normalized by L) is totally misleading here. What it even means to multiply 2 time periods?

好消息是这是您设计相似性功能的人.我们已经决定对对(顶点)的独立处理感到满意,并且我们仅希望单个相似性函数g(x, y)对其参数进行合理化处理.比较时间段的合理函数是什么?我想说减法是个不错的选择:

Good news is that this is you who design similarity function. We have already decided that we are satisfied with independent treatment of pairs (vertices), and we only want individual similarity function g(x, y) to make something reasonable with its arguments. And what is reasonable function to compare time periods? I'd say subtraction is a good candidate:

g(x, y) = abs(x - y)

这不是相似性函数,而是距离函数-值彼此越近,g()的结果越小-但最终的想法是相同的,因此我们可以在需要时互换它们.

This is not similarity function, but instead distance function - the closer are values to each other, the smaller is result of g() - but eventually idea is the same, so we can interchange them when we need.

我们可能还想通过平方差来增加大错配的影响:

We may also want to increase impact of large mismatches by squaring the difference:

g(x, y) = (x - y)^2 

嘿!我们刚刚重新发明了(均值)平方误差!现在,我们可以坚持使用MSE来计算距离,或者我们可以继续寻找良好的g()函数.

Hey! We've just reinvented (mean) squared error! We can now stick to MSE to calculate distance, or we can proceed finding good g() function.

有时我们可能不希望增加,而是希望平整差异.在这种情况下,我们可以使用log:

Sometimes we may want not increase, but instead smooth the difference. In this case we can use log:

g(x, y) = log(abs(x - y))

我们可以对零进行特殊处理,如下所示:

We can use special treatment for zeros like this:

g(x, y) = sign(x)*sign(y)*abs(x - y)   # sign(0) will turn whole expression to 0

或者我们可以通过反差来从距离回到相似度:

Or we can go back from distance to similarity by inversing the difference:

g(x, y) = 1 / abs(x - y)

请注意,在最近的选项中,我们没有使用归一化因子.实际上,您可以针对每种情况提出一些良好的规范化,也可以忽略它-并非总是需要规范或良好的规范.例如,在余弦相似度公式中,如果将归一化常数L=length(a) * length(b)更改为L=1,则会得到不同但仍然合理的结果.例如.

Note, that in recent options we haven't used normalization factor. In fact, you can come up with some good normalization for each case, or just omit it - normalization is not always needed or good. For example, in cosine similarity formula if you change normalization constant L=length(a) * length(b) to L=1, you will get different, but still reasonable results. E.g.

cos([90, 90, 90]) == cos(1000, 1000, 1000)  # measuring angle only
cos_no_norm([90, 90, 90]) < cos_no_norm([1000, 1000, 1000])  # measuring both - angle and magnitude

总结这个漫长而又无聊的故事,我建议重写余弦相似度/距离,以便在两个向量中使用某种类型的变量间差异.

Summarizing this long and mostly boring story, I would suggest rewriting cosine similarity/distance to use some kind of difference between variables in two vectors.

这篇关于余弦距离作为k均值的向量距离函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆