Spark ALS predictAll 返回空 [英] Spark ALS predictAll returns empty

查看:32
本文介绍了Spark ALS predictAll 返回空的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 Python 测试代码(ALS.train 的参数在别处定义):

I have the following Python test code (the arguments to ALS.train are defined elsewhere):

 r1 = (2, 1)
 r2 = (3, 1)
 test = sc.parallelize([r1, r2]) 
 model = ALS.train(ratings, rank, numIter, lmbda)
 predictions = model.predictAll(test)

 print test.take(1)     
 print predictions.count()
 print predictions

哪个有效,因为它对预测变量和输出的计数为 1:

Which works, because it has a count of 1 against the predictions variable and outputs:

[(2, 1)]
1
ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423

但是,当我尝试使用 RDD 时,我使用以下代码创建了自己,但它似乎不再起作用:

However, when I try and use an RDD I created myself using the following code, it doesn't appear to work anymore:

model = ALS.train(ratings, rank, numIter, lmbda)
validation_data = validation.map(lambda xs: tuple(int(x) for x in xs))
predictions = model.predictAll(validation_data)

print validation_data.take(1)
print predictions.count()
print validation_data

输出:

[(61, 3864)]
0
PythonRDD[4018] at RDD at PythonRDD.scala:43

如您所见,当传递映射的 RDD 时,predictAll 返回空.输入的值都是相同的格式.我能看到的唯一明显区别是第一个示例使用并行化并生成 ParallelCollectionRDD 而第二个示例仅使用生成 PythonRDD 的映射.predictAll 仅在传递某种类型的 RDD 时才有效吗?如果是这样,是否可以在 RDD 类型之间进行转换?我不知道如何让这个工作.

As you can see, predictAllcomes back empty when passed the mapped RDD. The values going in are both of the same format. The only noticeable difference that I can see is that the first example uses parallelize and produces a ParallelCollectionRDDwhereas the second example just uses a map which produces a PythonRDD. Does predictAll only work if passed a certain type of RDD? If so, is it possible to convert between RDD types? I'm not sure how to get this working.

推荐答案

MatrixFactorizationMode.predictAll 在两种基本条件下可能会返回项数少于输入项的 RDD:

There are two basic conditions under which MatrixFactorizationMode.predictAll may return a RDD with lower number of items than the input:

  • 训练集中缺少用户.
  • 训练集中缺少产品.

您可以轻松重现此行为并检查它是否不依赖于 RDD 的创建方式.首先让我们使用示例数据来构建模型:

You can easily reproduce this behavior and check that it is is not dependent on the way how RDD has been created. First lets use example data to build a model:

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

def parse(s):
    x, y, z  = s.split(",")
    return Rating(int(x), int(y), float(z))

ratings = (sc.textFile("data/mllib/als/test.data")
  .map(parse)
  .union(sc.parallelize([Rating(1, 5, 4.0)])))

model = ALS.train(ratings, 10, 10)

接下来让我们看看训练数据中存在哪些产品和用户:

Next lets see which products and users are present in the training data:

set(ratings.map(lambda r: r.product).collect())
## {1, 2, 3, 4, 5}

set(ratings.map(lambda r: r.user).collect())
## {1, 2, 3, 4}

现在让我们创建测试数据并检查预测:

Now lets create test data and check predictions:

valid_test = sc.parallelize([(2, 5), (1, 4), (3, 5)])
valid_test
## ParallelCollectionRDD[434] at parallelize at PythonRDD.scala:423

model.predictAll(valid_test).count()
## 3

到目前为止一切顺利.接下来让我们使用与您的代码相同的逻辑来映射它:

So far so good. Next lets map it using the same logic as in your code:

valid_test_ = valid_test.map(lambda xs: tuple(int(x) for x in xs))
valid_test_
## PythonRDD[497] at RDD at PythonRDD.scala:43

model.predictAll(valid_test_).count()
## 3

还好.接下来让我们创建无效数据并重复实验:

Still fine. Next lets create invalid data and repeat experiment:

invalid_test = sc.parallelize([
  (2, 6), # No product in the training data
  (6, 1)  # No user in the training data
])
invalid_test 
## ParallelCollectionRDD[500] at parallelize at PythonRDD.scala:423

model.predictAll(invalid_test).count()
## 0 

invalid_test_ = invalid_test.map(lambda xs: tuple(int(x) for x in xs))
model.predictAll(invalid_test_).count()
## 0

正如预期的那样,没有对无效输入的预测.

As expected there are no predictions for invalid input.

最后,您可以通过使用完全独立于 Python 代码的训练/预测的 ML 模型来确认情况确实如此:

Finally you can confirm this is really the case by using ML model which is completely independent in training / prediction from Python code:

from pyspark.ml.recommendation import ALS as MLALS

model_ml = MLALS(rank=10, maxIter=10).fit(
    ratings.toDF(["user", "item", "rating"])
)
model_ml.transform((valid_test + invalid_test).toDF(["user", "item"])).show()

## +----+----+----------+
## |user|item|prediction|
## +----+----+----------+
## |   6|   1|       NaN|
## |   1|   4| 1.0184212|
## |   2|   5| 4.0041084|
## |   3|   5|0.40498763|
## |   2|   6|       NaN|
## +----+----+----------+

如您所见,在训练数据中没有相应的用户/项目意味着没有预测.

As you can see no corresponding user / item in the training data means no prediction.

这篇关于Spark ALS predictAll 返回空的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆