为推荐引擎生成测试集 [英] Generating test set for recommendation engine

查看:46
本文介绍了为推荐引擎生成测试集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究基于隐式反馈的推荐引擎.我正在使用此链接:http://insightdatascience.com/blog/explicit_matrix_factorization.html#movielens

这使用 ALS(交替最小二乘法)来计算用户和项目向量.因为,我的数据集不能按时间分区.我随机从用户那里获取x"个评级并将它们放入测试集中.这是我的训练用户-项目矩阵的可重现示例.

<前>col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col1 col12 col13+---------------------------------------------------------------------------------------------------+|1 0 0 3 10 0 0 3 0 0 1 0 0 |||0 0 0 5 0 0 1 8 0 0 1 0 0 |||0 0 0 6 7 1 0 2 0 0 1 0 0 ||+---------------------------------------------------------------------------------------------------+

<前>然后我使用这段代码创建一个测试集test_ratings = np.random.choice(counts[user,:].nonzero()[0],size=1,replace=True)火车[用户,测试评级] = 0测试[用户,测试评级] = 计数[用户,测试评级]断言(np.all((火车*测试)== 0))

这给了我:

<前>col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col1 col12 col13+---------------------------------------------------------------------------------------------------+|0 0 0 0 0 0 0 3 0 0 0 0 0 |||0 0 0 0 0 0 1 0 0 0 0 0 0 |||0 0 0 6 0 0 0 0 0 0 0 0 0 ||+---------------------------------------------------------------------------------------------------+

这里的行是用户,列是项目.

现在,我想知道这是否是我的测试集的正确表示.我选择了一个非零值并将所有内容归零.所以,我的算法应该将非零值列为推荐项目.

这是正确的处理方式吗?

任何帮助将不胜感激

解决方案

更新:

是的,您应该使用一些原始计数创建一个测试集,然后查看您的系统是否将这些用户项识别为良好匹配.

您应该注意以下几点:

  • 只为您拥有的项目或用户添加测试集值更多数据;
  • 从训练数据中隐藏那些测试集值;
  • 仅在有数据的用户-项目对上训练模型,而不是在 0 上训练模型 - 这样做的原因是假设您的 0 代表没有数据的对,而不是真实评分;

注意:这篇论文,隐式反馈数据集的协同过滤,应该可以帮助您解决这些问题和其他问题.

I am working on a recommendation engine based on implicit feedback. I was using this link : http://insightdatascience.com/blog/explicit_matrix_factorization.html#movielens

This used ALS(Alternating least squares) to compute the user and item vectors. Since, my data set cannot be partitioned by time. I am randomly taking 'x' number of ratings from a user and putting them into the test set. This is a reproducible example of my training user-item matrix.


col1    col2     col3   col4   col5   col6    col7     col8    col9   col10   col1    col12    col13 
+---------------------------------------------------------------------------------------------------+
| 1        0       0     3      10      0       0         3        0      0       1       0        0 |                                                                                   | 
| 0        0       0     5      0       0        1         8        0      0       1       0        0 |                                                                                  |
| 0        0       0     6      7       1        0         2        0      0       1       0        0 |                                                                                   |
+---------------------------------------------------------------------------------------------------+

I then create a test set using this piece of code
    test_ratings = np.random.choice(counts[user,:].nonzero()[0],size=1,replace=True)
        train[user,test_ratings] = 0
        test[user,test_ratings] = counts[user,test_ratings]  
        assert(np.all((train * test) == 0)) 

Which gives me:

col1    col2     col3   col4   col5   col6    col7     col8    col9   col10   col1    col12    col13 
+---------------------------------------------------------------------------------------------------+
| 0        0       0     0      0      0       0         3        0      0       0       0        0 |                                                                                   | 
| 0        0       0     0      0      0       1         0        0      0       0       0        0 |                                                                                  |
| 0        0       0     6      0      0       0         0        0      0       0       0        0 |                                                                                   |
+---------------------------------------------------------------------------------------------------+

Here the rows are users and columns are items.

Now, I was wondering if this is a correct representation of my test set. I have picked up one non zero value and made everything zero. So, my algorithm should be ranking the non zero value as the recommended item.

Is this the correct way of going about things?

Any help would be really appreciated

解决方案

Updated:

Yes you should create a test set with some of your original counts and see if your system identifies those user-items as a good match.

You should be careful with a few things:

  • only put on your testset values for items or users where you have more data;
  • hide those testset values from the training data;
  • train your model only on the user-item pairs where you have data, not on the 0's - The reason for this is because the assumption is that your 0's represent pairs for which you don't have data, and not real ratings;

Note: This papper, Collaborative Filtering for Implicit Feedback Datasets, should help you with these and other questions.

这篇关于为推荐引擎生成测试集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆