如何使用DataFrame在Spark中构建CoordinateMatrix? [英] How can I build a CoordinateMatrix in Spark using a DataFrame?

查看:233
本文介绍了如何使用DataFrame在Spark中构建CoordinateMatrix?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将ALS算法的Spark实现用于推荐系统,因此我构建了如下所示的DataFrame作为训练数据:

I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data:

|--------------|--------------|--------------|
|    userId    |    itemId    |    rating    |
|--------------|--------------|--------------|

现在,我想创建一个稀疏矩阵,以表示每个用户和每个项目之间的交互.矩阵将是稀疏的,因为如果用户和项目之间没有交互,则矩阵中的对应值将为零.因此,最后,大多数值将为零.

Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value in the matrix will be zero. Thus, in the end, most values will be zero.

但是如何使用CoordinateMatrix实现呢?我之所以说CoordinateMatrix,是因为我将Spark 2.1.1与python配合使用,并且在文档中,我看到仅当矩阵的两个维度都很大且矩阵非常稀疏时,才应使用CoordinateMatrix.

But how can I achieve this, using a CoordinateMatrix? I'm saying CoordinateMatrix because I'm using Spark 2.1.1, with python, and in the documentation, I saw that a CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.

换句话说,我怎么能从这个DataFrame到一个CoordinateMatrix,行将是用户,列将是项,而等级将是矩阵中的值?

In other words, how can I get from this DataFrame to a CoordinateMatrix, where the rows would be users, the columns would be items and the ratings would be the values in the matrix?

推荐答案

CoordinateMatrix只是MatrixEntrys的RDD的包装. MatrixEntry只是一个(长,长,浮点)元组的包装. Pyspark允许您从此类元组的RDD创建CoordinateMatrix.如果userIditemId字段均为IntegerTypes,而rating类似于FloatType,则创建所需矩阵非常简单.

A CoordinateMatrix is just a wrapper for an RDD of MatrixEntrys. A MatrixEntry is just a wrapper over a (long, long, float) tuple. Pyspark allows you to create a CoordinateMatrix from an RDD of such tuples. If the userId and itemId fields are both IntegerTypes and the rating is something like a FloatType, then creating the desired matrix is very straightforward.

from pyspark.mllib.linalg.distributed import CoordinateMatrix

cmat=CoordinateMatrix(df.rdd.map(tuple))

如果您有userIditemId字段的StringTypes,则只会稍微复杂一点.您需要先为这些字符串建立索引,然后再将索引传递给CoordinateMatrix.

It is only slightly more complicated if you have StringTypes for the userId and itemId fields. You would need to index those strings first and then pass the indices to the CoordinateMatrix.

这篇关于如何使用DataFrame在Spark中构建CoordinateMatrix?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆