什么是错的“集体智慧编程”皮尔森算法? [英] What is wrong with the pearson algorithm from “Programming Collective Intelligence”?
问题描述
这个功能是从书集体智慧编程,而应该计算Pearson相关系数为P1和P2,这应该是一个介于-1到1。
如果两个影评率的项目非常相似的功能,应该返回1或接近1。
使用真实的用户数据我有时会有奇怪的结果。在下面的例子中,数据集critics2应该返回1 - 而是返回0
。有没有人发现一个错误?
(这不是<一个副本href="http://stackoverflow.com/questions/1423525/what-is-wrong-with-this-python-function-from-programming-collective-intelligence">What是错的集体智慧编程这条巨蟒功能)
从__future__进口师
从数学进口开方
高清sim_pearson(preFS,P1,P2):
SI = {}
在preFS [P1]项目:
如果在preFS [P2]项目:SI [项目] = 1
如果len(SI)== 0:返回0
N = LEN(SI)
SUM1 = SUM([preFS [P1] [它]它在SI])
SUM2 = SUM([preFS [P2] [它]它在SI])
sum1Sq = SUM([POW(preFS [P1] [是],2)它在SI])
sum2Sq = SUM([POW(preFS [P2] [是],2)它在SI])
PSUM = SUM([preFS [P1] [它] *在SI preFS [P2] [它]它])
NUM = pSum-(SUM1 * SUM2 / N)
书房=开方((sum1Sq-POW(sum1,2)/ N)*(sum2Sq-POW(sum2,2)/ N))
如果den == 0:返回0
R = num / den的
回报 - [R
评论家= {
用户'user1':{
ITEM1':3,
项目2:5,
'item3的':5,
},
'用户2:{
ITEM1:4,
项目2:5,
'item3的':5,
}
}
critics2 = {
用户'user1':{
ITEM1:5,
项目2:5,
'item3的':5,
},
'用户2:{
ITEM1:5,
项目2:5,
'item3的':5,
}
}
critics3 = {
用户'user1':{
ITEM1:1,
项目2:3,
'item3的':5,
},
'用户2:{
ITEM1:5,
项目2:3,
项目3:1,
}
}
打印sim_pearson(评论家,用户1,用户2,)
结果:1.0(预期)
打印sim_pearson(critics2,用户1,用户2,)
结果:0(意外)
打印sim_pearson(critics3,用户1,用户2,)
结果:1(预期)
没有什么错在你的结果。您正在试图通过3点绘制一条线。在第二种情况下,你有三个点相同的坐标,即有效一点。你不能说做这些点关联或反相关,因为你可以通过一个点(在code书房
等于零)画线的无限多
This function is from the book "Programming Collective Intelligence", and is supposed to calculate the Pearson correlation coefficient for p1 and p2, which is supposed to be a number between -1 and 1.
If two critics rate items very similarly the function should return 1, or close to 1.
With real user data I sometimes get weird results. In the following example the dataset critics2 should return 1 - instead it returns 0.
Does anyone spot a mistake?
(This is not a duplicate of What is wrong with this python function from "Programming Collective Intelligence")
from __future__ import division
from math import sqrt
def sim_pearson(prefs,p1,p2):
si={}
for item in prefs[p1]:
if item in prefs[p2]: si[item]=1
if len(si)==0: return 0
n=len(si)
sum1=sum([prefs[p1][it] for it in si])
sum2=sum([prefs[p2][it] for it in si])
sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
sum2Sq=sum([pow(prefs[p2][it],2) for it in si])
pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
num=pSum-(sum1*sum2/n)
den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
r=num/den
return r
critics = {
'user1':{
'item1': 3,
'item2': 5,
'item3': 5,
},
'user2':{
'item1': 4,
'item2': 5,
'item3': 5,
}
}
critics2 = {
'user1':{
'item1': 5,
'item2': 5,
'item3': 5,
},
'user2':{
'item1': 5,
'item2': 5,
'item3': 5,
}
}
critics3 = {
'user1':{
'item1': 1,
'item2': 3,
'item3': 5,
},
'user2':{
'item1': 5,
'item2': 3,
'item3': 1,
}
}
print sim_pearson(critics, 'user1', 'user2', )
result: 1.0 (expected)
print sim_pearson(critics2, 'user1', 'user2', )
result: 0 (unexpected)
print sim_pearson(critics3, 'user1', 'user2', )
result: -1 (expected)
There is nothing wrong in your result. You are trying to plot a line through 3 points. In second case you have all three points with the same coordinates, i.e. effectively one point. You can't say do these points correlate or anti-correlate, because you can draw infinite number of lines through one point (den
in your code equals to zero).
这篇关于什么是错的“集体智慧编程”皮尔森算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!