Apache Mahout +欧几里德距离:意外结果 [英] Apache Mahout + Euclidean Distance: Unexpected Results

查看:139
本文介绍了Apache Mahout +欧几里德距离:意外结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于以下用户偏好数据集,我正在使用Mahout的EuclideanDistanceSimilarity类对多个用户的相似性进行排名.首选项的范围目前是1到5之间的所有整数.但是,我可以控制秤,如果有帮助,可以更改.

I'm using Mahout's EuclideanDistanceSimilarity class to rank the similarity of several users given the following data set of user preferences. The range for preferences is currently all integers from 1 to 5 inclusive. However I have control over the scale, so that can change if it would help.

User    Preferences:
        Item 1    Item 2    Item 3    Item 4    Item 5    Item 6
 1       2         4         3         5         1         2
 2       5         1         5         1         5         1
 3       1         5         1         5         1         5
 4       2         4         3         5         1         2
 5       3         3         4         5         2         2

运行以下测试代码时,我得到了意外的结果,该代码已添加到此处的Test类中:

I'm getting unexpected results when I run the following test code, which I added to the Test class found here: http://www.massapi.com/source/mahout-distribution-0.4/core/src/test/java/org/apache/mahout/cf/taste/impl/similarity/EuclideanDistanceSimilarityTest.java.html

@Test
public void testSimple2() throws Exception {
    DataModel dataModel = getDataModel(
            new long[]{1, 2, 3, 4, 5},
            new Double[][]{
                {2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
                {5.0, 1.0, 5.0, 1.0, 5.0, 1.0},
                {1.0, 5.0, 1.0, 5.0, 1.0, 5.0},
                {2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
                {3.0, 3.0, 4.0, 5.0, 2.0, 2.0},});
    for (int i = 1; i <= 5; i++) {
        for (int j = 1; j <= 5; j++) {
            System.out.println( i + "," + j + ": " + new EuclideanDistanceSimilarity(dataModel).userSimilarity(i, j));
        }
    }
}

它产生以下结果:

1,1: 1.0
1,2: 0.7129109430106292
1,3: 1.0
1,4: 1.0
1,5: 1.0
2,1: 0.7129109430106292
2,2: 1.0
2,3: 0.5556605665978556
2,4: 0.7129109430106292
2,5: 0.8675434911352263
3,1: 1.0
3,2: 0.5556605665978556
3,3: 1.0
3,4: 1.0
3,5: 0.9683428667784535
4,1: 1.0
4,2: 0.7129109430106292
4,3: 1.0
4,4: 1.0
4,5: 1.0
5,1: 1.0
5,2: 0.8675434911352263
5,3: 0.9683428667784535
5,4: 1.0
5,5: 1.0

有人可以帮助我了解我在这里做错了什么吗?显然,用户1的首选项与用户3& 2不同. 5,为什么我的相似度为1.0?

Would someone please help me understand what I'm doing wrong here? Clearly, user 1's preferences are not identical to users 3 & 5, so why do I get 1.0 for the similarity?

如果欧几里得无法使用,我愿意使用其他算法,但是Pearson不适用于我,因为我需要处理对每个项目都提交相同首选项的用户,并且我不希望对成绩"进行更正通货膨胀."

I'm open to using a different algorithm if Euclidean won't work, however Pearson doesn't work for me because I need to handle users that submit identical preferences for each item and I do not want to correct for "grade inflation."

推荐答案

有点奇怪,但是我可以解释发生了什么.

It is a little weird but I can explain what's happening.

欧几里德距离d不能直接用作相似性度量,因为它随着相似性降低"而变大.您可以使用1/d,但是完美匹配会导致无限,而不是1.您可以使用1/(1 + d).

The Euclidean distance d can't be used as a similarity metric directly since it gets bigger with "less similarity". You could use 1/d, but then perfect matches result in infinity, not 1. You can use 1/(1+d).

问题在于,只能根据两个用户共同拥有的尺寸来计算距离.更大的尺寸通常意味着更大的距离.因此,重叠是不利的,与您期望的相反.

The problem is that the distance can only be calculated over dimensions that both users have in common. More dimensions typically means more distance. So it's penalizing overlap, the opposite of what you'd expect.

因此,公式实际上是n/(1 + d),其中n是重叠维数.导致相似度大于1,在某些情况下,相似度上限为1.

So the formula is really n/(1+d), where n is the number of dimensions of overlap. That results in a similarity greater than 1, which is capped back to 1, in some cases.

n不是正确的因素.这是一个古老的简单想法.我会在邮件列表中询问正确的表达方式.对于大数据,这通常可以正常工作.

n is not the right factor. It's an old simple kludge. I will ask on the mailing list about the right-er expression. For large data, this tends to work OK though.

这篇关于Apache Mahout +欧几里德距离:意外结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆