如何使用 DataFrame 在 Spark 中构建 CoordinateMatrix? [英] How can I build a CoordinateMatrix in Spark using a DataFrame?

查看:30
本文介绍了如何使用 DataFrame 在 Spark 中构建 CoordinateMatrix?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 ALS 算法的 Spark 实现用于推荐系统,因此我构建了如下所示的 DataFrame 作为训练数据:

I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data:

|--------------|--------------|--------------|
|    userId    |    itemId    |    rating    |
|--------------|--------------|--------------|

现在,我想创建一个稀疏矩阵来表示每个用户和每个项目之间的交互.该矩阵将是稀疏的,因为如果用户和项目之间没有交互,则矩阵中的相应值将为零.因此,最终,大多数值将为零.

Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value in the matrix will be zero. Thus, in the end, most values will be zero.

但是如何使用 CoordinateMatrix 实现这一点?我说 CoordinateMatrix 是因为我使用的是 Spark 2.1.1 和 python,并且在文档中,我看到只有当矩阵的两个维度都很大且矩阵非常稀疏时才应使用 CoordinateMatrix.

But how can I achieve this, using a CoordinateMatrix? I'm saying CoordinateMatrix because I'm using Spark 2.1.1, with python, and in the documentation, I saw that a CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.

换句话说,我怎样才能从这个 DataFrame 到 CoordinateMatrix,其中行是用户,列是项目,评分是矩阵中的值?

In other words, how can I get from this DataFrame to a CoordinateMatrix, where the rows would be users, the columns would be items and the ratings would be the values in the matrix?

推荐答案

CoordinateMatrix 只是 MatrixEntrys 的 RDD 的包装器.MatrixEntry 只是(长、长、浮点)元组的包装器.Pyspark 允许您从此类元组的 RDD 创建 CoordinateMatrix.如果 userIditemId 字段都是 IntegerTypes 并且 rating 类似于 FloatType,那么创建所需的矩阵非常简单.

A CoordinateMatrix is just a wrapper for an RDD of MatrixEntrys. A MatrixEntry is just a wrapper over a (long, long, float) tuple. Pyspark allows you to create a CoordinateMatrix from an RDD of such tuples. If the userId and itemId fields are both IntegerTypes and the rating is something like a FloatType, then creating the desired matrix is very straightforward.

from pyspark.mllib.linalg.distributed import CoordinateMatrix

cmat=CoordinateMatrix(df.rdd.map(tuple))

如果您为 userIditemId 字段设置了 StringType,情况会稍微复杂一些.您需要先索引这些字符串,然后将索引传递给 CoordinateMatrix.

It is only slightly more complicated if you have StringTypes for the userId and itemId fields. You would need to index those strings first and then pass the indices to the CoordinateMatrix.

这篇关于如何使用 DataFrame 在 Spark 中构建 CoordinateMatrix?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆