在推荐器系统中,如何在Pearson相关用户-用户相似性矩阵中处理​​NaN? [英] How is NaN handled in Pearson correlation user-user similarity matrix in a recommender system?

查看:99
本文介绍了在推荐器系统中,如何在Pearson相关用户-用户相似性矩阵中处理​​NaN?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在根据用户评分数据(尤其是MovieLens100K数据)生成用户-用户相似度矩阵.计算相关性会导致某些NaN值.我已经在较小的数据集中进行了测试:

用户项评分矩阵

   I1 I2 I3 I4
U1 4  0  5  5  
U2 4  2  1  0  
U3 3  0  2  4  
U4 4  4  0  0  

用户-用户Pearson相关性相似度矩阵

              U1        U2        U3       U4      U5
U1             1        -1         0      -nan  0.755929
U2            -1         1         1      -nan -0.327327
U3             0         1         1      -nan  0.654654
U4          -nan      -nan      -nan      -nan      -nan
U5      0.755929 -0.327327  0.654654      -nan         1

对于计算皮尔逊相关性,两个用户之间仅考虑相关项目. (请参阅迈向下一代推荐系统:Gediminas的最新技术和可能的扩展调查Adomavicius,亚历山大·图芝琳

我该如何处理NaN值?

编辑 这是我在R中找到皮尔逊相关性的代码.R矩阵是用户项评级矩阵.包含1到5个标度等级0表示未评级. S是用户-用户相关性矩阵.

  for (i in 1:nrow (R))
  {
    cat ("user: ", i, "\n");
    for (k in 1:nrow (R))
    {
      if (i != k)
      {
        corated_list <- which (((R[i,] != 0) & (R[k,] != 0)) == TRUE);
        ui <- (R[i,corated_list] - mean (R[i,corated_list]));
        uk <- (R[k,corated_list] - mean (R[k,corated_list]));
        temp <- sum (ui * uk) / sqrt (sum (ui^2) * sum (uk^2));
        S[i,k] <- ifelse (is.nan (temp), 0, temp)
      }
      else
      {
        S[i,k] <- 0;
      }
    }
  }

请注意,在S[i,k] <- ifelse (is.nan (temp), 0, temp)行中,我将NaN替换为0.

解决方案

我最近用Java开发了针对用户和用户的推荐系统.用户项目矩阵.首先,您可能已经发现了. RS很难.对于我的实现,我使用了很棒的Apache Common Math Library,您使用的是R,它的计算方式可能与Pearson的相对相似.

您的问题是:如何处理NaN值,然后进行编辑,说您说NaN = 0.

我的答案是这样的:

您不应该真正将NaN值处理为0,因为您所说的是用户或用户/项目之间完全没有没有相关性. 可能是这种情况,但并非总是如此.忽略此歪曲您的建议.

首先,您应该问自己:为什么我会得到NaN值"?以下是NaN Wiki页面上的一些原因,详细说明了为什么您会获得NaN值:

可以返回NaN的三种操作:

  1. 使用NaN作为至少一个操作数的操作.

  2. 不确定的形式 除法0/0和±∞/±∞ 乘法0×±∞和±∞×0 ∞+(−∞),(−∞)+∞的加法和等效减法 该标准具有以下替代功能: 标准pow函数和整数指数pown函数将00、1∞和∞0定义为1. powr函数将所有三个不确定的形式定义为无效操作,因此返回NaN.

  3. 具有复杂结果的实际操作,例如: 负数的平方根. 负数的对数 小于-1或大于+1的数字的反正弦或余弦.

您应该调试您的应用程序,并逐步执行每个步骤,以查看引起干扰的原因是上述哪些原因.

第二个理解是,皮尔森相关性可以用多种不同的方式表示,因此您需要考虑是否要在样本或总体中进行计算,然后找到适当的计算方法,即针对总体:

cor(X,Y)=Σ[(xi-E(X))(yi-E(Y))]/[(n-1)s(X)s(Y)]

其中E(X)是X的平均值, E(Y)是Y值的平均值, s(X),s(Y)是标准偏差, 标准偏差通常是方差的正平方根, 方差= sum((x_i-平均值)^ 2)/(n-1)

其中均值是均值和 n是样本观测值的数量.

这可能是您的NaN出现的地方,即除以0表示未评级.如果可以的话,我建议不要使用0来表示未评级,而应使用null.我这样做有两个原因: 1. 0可能是NaN提升您的结果的原因,并且 2.可读性/可理解性.您的小数位数为1-5,因此不应包含0,这会使事情变得混乱.因此,请尽可能避免这种情况.

从推荐者的角度出发,从推荐的角度考虑问题.如果您有2个用户,而他们只有1个共同点,则在较小的数据集中为I1说U1和U4.这一项共同点真的足以提供建议吗?答案是-当然不是.因此,我还可以建议您设置ratingIncomon的最低阈值,以确保推荐质量更好.您可以为此阈值设置的最小值为2,但可以考虑将其设置得更高一些.如果您阅读了MovieLens研究报告,那么他们会将其设置为5-10(想不到,请记住我的头顶).设置得越高,覆盖的范围就越小,但是您将获得更好"(错误分数越低)的建议.您可能已经读完了学术文献,然后您可能会对此有所了解,但是我还是想提一下.

关于以上几点.查看U4并与其他所有用户进行比较.请注意,U4与任何用户共用的项目不超过1个.现在,希望您会注意到NaN仅与U4一起出现.如果您已经遵循了这个答案,那么您现在将希望看到出现NaN的原因是因为您实际上可以只使用1个共同项来计算Pearson的:).

最后,对上面的样本数据集有点困扰我的一件事是1和-1的相关数.考虑一下这些用户偏好的实际含义,然后根据实际评分检查它们.例如.看一下U1和U2的等级.对于第1项,他们具有1的强正相关性(均被评为4),然后对于第3项,它们具有很强的负相关性(U1对其评级为5,U3对其评级为1),这两个用户之间的Pearson相关性为-似乎很奇怪1(即他们的偏好完全相反).显然不是这种情况,实际上Pearson得分应该略高于0或低于0.这个问题回溯到有关在比例尺上使用0以及仅将少量项目进行比较的要点.

现在,有一些策略可以填充"用户尚未评分的项目.我不打算深入探讨它们,您需要仔细阅读,但从本质上讲,这就像使用该项目的平均得分或该用户的平均评分一样.两种方法都有其缺点,而且我个人并不真的喜欢这两种方法.我的建议是仅在用户共有5个或更多项目时计算用户之间的Pearson相关性,而忽略评分为0(或更好-为零)评分的项目.

总结一下.

  1. NaN不等于0,因此请勿将其设置为0.
  2. 秤中的
  3. 0最好表示为空
  4. 仅当两个用户之间共有的项数大于1,最好大于5/10时,才应计算Pearson相关.
  5. 仅为两个拥有共同评分项目的用户计算皮尔逊相关性,而不在分数中不包含未被其他用户评分的项目.

希望有帮助,祝你好运.

I am generating a user-user similarity matrix from a user-rating data (particularly MovieLens100K data). Computing correlation leads to some NaN values. I have tested in a smaller dataset:

User-Item rating matrix

   I1 I2 I3 I4
U1 4  0  5  5  
U2 4  2  1  0  
U3 3  0  2  4  
U4 4  4  0  0  

User-User Pearson Correlation similarity matrix

              U1        U2        U3       U4      U5
U1             1        -1         0      -nan  0.755929
U2            -1         1         1      -nan -0.327327
U3             0         1         1      -nan  0.654654
U4          -nan      -nan      -nan      -nan      -nan
U5      0.755929 -0.327327  0.654654      -nan         1

For computing the pearson correlation , only corated items are considered between two users. (See Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, Gediminas Adomavicius, Alexander Tuzhilin

How can i handle the NaN values?

EDIT Here is a code with which i find pearson correlation in R. The R matrix is the user-item rating matrix. Contains 1 to 5 scale rating 0 means not rated. S is the user-user correlation matrix.

  for (i in 1:nrow (R))
  {
    cat ("user: ", i, "\n");
    for (k in 1:nrow (R))
    {
      if (i != k)
      {
        corated_list <- which (((R[i,] != 0) & (R[k,] != 0)) == TRUE);
        ui <- (R[i,corated_list] - mean (R[i,corated_list]));
        uk <- (R[k,corated_list] - mean (R[k,corated_list]));
        temp <- sum (ui * uk) / sqrt (sum (ui^2) * sum (uk^2));
        S[i,k] <- ifelse (is.nan (temp), 0, temp)
      }
      else
      {
        S[i,k] <- 0;
      }
    }
  }

Note that in the S[i,k] <- ifelse (is.nan (temp), 0, temp) line i am replacing the NaNs with 0.

解决方案

I recently developed a recommender system in Java for user-user & user-item matrix. Firstly as you probably already have found. RS are difficult. For my implementation I utilised the Apache Common Math Library which is fantastic, you are using R which is probably relatively similar in how it calculates Pearson's.

Your question was: How can I handle NaN values, followed by an edit saying you are saying NaN is = 0.

My answer is this:

You shouldn't really handle NaN values as 0, because what you are saying is that there is absolutely no correlation between users or users/items. This might be the case, but it is likely not always the case. Ignoring this will skew your recommendations.

Firstly you should be asking yourself, "why am I getting NaN values"? Here are some reasons from the Wiki page of NaN detailing why you might get a NaN value:

There are three kinds of operations that can return NaN:

  1. Operations with a NaN as at least one operand.

  2. Indeterminate forms The divisions 0/0 and ±∞/±∞ The multiplications 0×±∞ and ±∞×0 The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions The standard has alternative functions for powers: The standard pow function and the integer exponent pown function define 00, 1∞, and ∞0 as 1. The powr function defines all three indeterminate forms as invalid operations and so returns NaN.

  3. Real operations with complex results, for example: The square root of a negative number. The logarithm of a negative number The inverse sine or cosine of a number that is less than −1 or greater than +1.

You should debug your application and step through each step to see which of the above reasons is the offending cause.

Secondly understanding that Pearsons Correlation can be represented in a number of different ways, and you need to consider whether you are calculating it across a sample or population and then find the appropriate method of calculating it i.e. for a population:

cor(X, Y) = Σ[(xi - E(X))(yi - E(Y))] / [(n - 1)s(X)s(Y)]

where E(X) is the mean of X, E(Y) is the mean of the Y values and s(X), s(Y) are standard deviations and standard deviations is generally the positive square root of the variance and variance = sum((x_i - mean)^2) / (n - 1)

where mean is the Mean and n is the number of sample observations.

This is probably where your NaN are appearing i.e. dividing by 0 for not rated. If you can I would suggest not using the value of 0 to mean not rated, instead use null. I would do this for 2 reasons: 1. The 0 is probably what is cocking up your results with NaNs, and 2. Readability / Understandability. Your Scale is 1 - 5, so 0 should not feature, confuses things. So avoid that if possible.

Thirdly from a recommender stand point, think about things from a recommendation point of view. If you have 2 users and they only have 1 rating in common, say U1 and U4 for I1 in your smaller dataset. Is that 1 item in common really enough to offer recommendations on? The answer is - of course not. So can I also suggest you set a minimum threshold of ratingsInCommon to ensure that the quality of recommendation is better. The minimum you can set for this threshold is 2, but consider setting it a bit higher. If you read the MovieLens research then they set it to between 5-10 (cant remember off the top of my head). The higher you set this the less coverage you will get but you will achieve "better" (lower error scores) recommendations. You've probably done your reading of the academic literature then you will have probably picked up on this point, but thought I would mention it anyway.

On the above point. Look at U4 and compare with every other User. Notice how U4 does not have more that 1 item in common with any user. Now hopefully you will notice that the NaNs appear exclusively with U4. If you have followed this answer then you will hopefully now see that the reason you are getting NaNs is because you can actually compute Pearson's with just 1 item in common :).

Finally one thing that slightly bothers me about the sample dataset above is number of correlations that are 1's and -1's. Think about what that is actually saying about these users preferences, then sense check them against the actual ratings. E.g. look at U1 and U2 ratings. for Item 1 they have strong positive correlation of 1 (both rated it a 4) then for Item 3 they have a strong negative correlation (U1 rated it 5, U3 rated it 1), seems strange that Pearson Correlation between these two users is -1 (i.e. their preferences are completely opposite). This is clearly not the case, really the Pearson score should be a bit above or a bit below 0. This issue links back into points about using 0 on the scale and also comparing only a small amount of items together.

Now, there are strategies in place for "filling in" items that users have not rated. I am not going to go into them you need read up on that, but essentially it is like using the average score for that item or the average rating for that user. Both methods have their downsides, and personally I don't really like either of them. My advice is to only calculate Pearson correlations between users when they have 5 or more items in common, and ignore the items where ratings are 0 (or better - null) ratings.

So to conclude.

  1. NaN does not equal 0 so do not set it to 0.
  2. 0's in your scale are better represented as null
  3. You should only calculate Pearson Correlations when the number of items in common between two users is >1, preferably greater than 5/10.
  4. Only calculate the Pearson Correlation for two users where they have commonly rated items, do not include items in the score that have not been rated by the other user.

Hope that helps and good luck.

这篇关于在推荐器系统中,如何在Pearson相关用户-用户相似性矩阵中处理​​NaN?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆