Spark ALS ForecastAll返回空 [英] Spark ALS predictAll returns empty

查看:166
本文介绍了Spark ALS ForecastAll返回空的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下Python测试代码(ALS.train的参数在其他位置定义):

I have the following Python test code (the arguments to ALS.train are defined elsewhere):

 r1 = (2, 1)
 r2 = (3, 1)
 test = sc.parallelize([r1, r2]) 
 model = ALS.train(ratings, rank, numIter, lmbda)
 predictions = model.predictAll(test)

 print test.take(1)     
 print predictions.count()
 print predictions

之所以可行,是因为它对预测变量计数为1并输出:

Which works, because it has a count of 1 against the predictions variable and outputs:

[(2, 1)]
1
ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423

但是,当我尝试使用RDD我使用以下代码创建自己时,它似乎不再起作用:

However, when I try and use an RDD I created myself using the following code, it doesn't appear to work anymore:

model = ALS.train(ratings, rank, numIter, lmbda)
validation_data = validation.map(lambda xs: tuple(int(x) for x in xs))
predictions = model.predictAll(validation_data)

print validation_data.take(1)
print predictions.count()
print validation_data

哪个输出:

[(61, 3864)]
0
PythonRDD[4018] at RDD at PythonRDD.scala:43

如您所见,predictAll在传递映射的RDD时恢复为空.输入的值都是相同的格式.我可以看到的唯一明显的区别是,第一个示例使用了parallelize并生成了ParallelCollectionRDD,而第二个示例仅使用了一个生成PythonRDD的映射.只有通过某种类型的RDD时,predictAll才有效吗?如果是这样,是否可以在RDD类型之间进行转换?我不确定如何使它正常工作.

As you can see, predictAllcomes back empty when passed the mapped RDD. The values going in are both of the same format. The only noticeable difference that I can see is that the first example uses parallelize and produces a ParallelCollectionRDDwhereas the second example just uses a map which produces a PythonRDD. Does predictAll only work if passed a certain type of RDD? If so, is it possible to convert between RDD types? I'm not sure how to get this working.

推荐答案

在两个基本条件下,MatrixFactorizationMode.predictAll可能会返回条目数少于输入内容的RDD:

There are two basic conditions under which MatrixFactorizationMode.predictAll may return a RDD with lower number of items than the input:

    培训集中缺少
  • 用户.
  • 训练集中缺少产品.
  • user is missing in the training set.
  • product is missing in the training set.

您可以轻松地重现此行为,并检查它是否不依赖于RDD的创建方式.首先,让我们使用示例数据来构建模型:

You can easily reproduce this behavior and check that it is is not dependent on the way how RDD has been created. First lets use example data to build a model:

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

def parse(s):
    x, y, z  = s.split(",")
    return Rating(int(x), int(y), float(z))

ratings = (sc.textFile("data/mllib/als/test.data")
  .map(parse)
  .union(sc.parallelize([Rating(1, 5, 4.0)])))

model = ALS.train(ratings, 10, 10)

下一步,让我们查看培训数据中存在哪些产品和用户:

Next lets see which products and users are present in the training data:

set(ratings.map(lambda r: r.product).collect())
## {1, 2, 3, 4, 5}

set(ratings.map(lambda r: r.user).collect())
## {1, 2, 3, 4}

现在,我们可以创建测试数据并检查预测:

Now lets create test data and check predictions:

valid_test = sc.parallelize([(2, 5), (1, 4), (3, 5)])
valid_test
## ParallelCollectionRDD[434] at parallelize at PythonRDD.scala:423

model.predictAll(valid_test).count()
## 3

到目前为止,一切都很好.接下来,使用与代码中相同的逻辑来映射它:

So far so good. Next lets map it using the same logic as in your code:

valid_test_ = valid_test.map(lambda xs: tuple(int(x) for x in xs))
valid_test_
## PythonRDD[497] at RDD at PythonRDD.scala:43

model.predictAll(valid_test_).count()
## 3

仍然可以.接下来让我们创建无效数据并重复实验:

Still fine. Next lets create invalid data and repeat experiment:

invalid_test = sc.parallelize([
  (2, 6), # No product in the training data
  (6, 1)  # No user in the training data
])
invalid_test 
## ParallelCollectionRDD[500] at parallelize at PythonRDD.scala:423

model.predictAll(invalid_test).count()
## 0 

invalid_test_ = invalid_test.map(lambda xs: tuple(int(x) for x in xs))
model.predictAll(invalid_test_).count()
## 0

正如预期的那样,没有对无效输入的预测.

As expected there are no predictions for invalid input.

最后,您可以使用完全独立于Python代码的训练/预测的ML模型来确认确实如此:

Finally you can confirm this is really the case by using ML model which is completely independent in training / prediction from Python code:

from pyspark.ml.recommendation import ALS as MLALS

model_ml = MLALS(rank=10, maxIter=10).fit(
    ratings.toDF(["user", "item", "rating"])
)
model_ml.transform((valid_test + invalid_test).toDF(["user", "item"])).show()

## +----+----+----------+
## |user|item|prediction|
## +----+----+----------+
## |   6|   1|       NaN|
## |   1|   4| 1.0184212|
## |   2|   5| 4.0041084|
## |   3|   5|0.40498763|
## |   2|   6|       NaN|
## +----+----+----------+

您在培训数据中看不到相应的用户/项目,这意味着没有任何预测.

As you can see no corresponding user / item in the training data means no prediction.

这篇关于Spark ALS ForecastAll返回空的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆