Spark ALS predictAll 返回空 [英] Spark ALS predictAll returns empty
问题描述
我有以下 Python 测试代码(ALS.train
的参数在别处定义):
I have the following Python test code (the arguments to ALS.train
are defined elsewhere):
r1 = (2, 1)
r2 = (3, 1)
test = sc.parallelize([r1, r2])
model = ALS.train(ratings, rank, numIter, lmbda)
predictions = model.predictAll(test)
print test.take(1)
print predictions.count()
print predictions
哪个有效,因为它对预测变量和输出的计数为 1:
Which works, because it has a count of 1 against the predictions variable and outputs:
[(2, 1)]
1
ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423
但是,当我尝试使用 RDD
时,我使用以下代码创建了自己,但它似乎不再起作用:
However, when I try and use an RDD
I created myself using the following code, it doesn't appear to work anymore:
model = ALS.train(ratings, rank, numIter, lmbda)
validation_data = validation.map(lambda xs: tuple(int(x) for x in xs))
predictions = model.predictAll(validation_data)
print validation_data.take(1)
print predictions.count()
print validation_data
输出:
[(61, 3864)]
0
PythonRDD[4018] at RDD at PythonRDD.scala:43
如您所见,当传递映射的 RDD
时,predictAll
返回空.输入的值都是相同的格式.我能看到的唯一明显区别是第一个示例使用并行化并生成 ParallelCollectionRDD
而第二个示例仅使用生成 PythonRDD
的映射.predictAll
仅在传递某种类型的 RDD
时才有效吗?如果是这样,是否可以在 RDD
类型之间进行转换?我不知道如何让这个工作.
As you can see, predictAll
comes back empty when passed the mapped RDD
. The values going in are both of the same format. The only noticeable difference that I can see is that the first example uses parallelize and produces a ParallelCollectionRDD
whereas the second example just uses a map which produces a PythonRDD
. Does predictAll
only work if passed a certain type of RDD
? If so, is it possible to convert between RDD
types? I'm not sure how to get this working.
推荐答案
MatrixFactorizationMode.predictAll
在两种基本条件下可能会返回项数少于输入项的 RDD:
There are two basic conditions under which MatrixFactorizationMode.predictAll
may return a RDD with lower number of items than the input:
- 训练集中缺少用户.
- 训练集中缺少产品.
您可以轻松重现此行为并检查它是否不依赖于 RDD 的创建方式.首先让我们使用示例数据来构建模型:
You can easily reproduce this behavior and check that it is is not dependent on the way how RDD has been created. First lets use example data to build a model:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
def parse(s):
x, y, z = s.split(",")
return Rating(int(x), int(y), float(z))
ratings = (sc.textFile("data/mllib/als/test.data")
.map(parse)
.union(sc.parallelize([Rating(1, 5, 4.0)])))
model = ALS.train(ratings, 10, 10)
接下来让我们看看训练数据中存在哪些产品和用户:
Next lets see which products and users are present in the training data:
set(ratings.map(lambda r: r.product).collect())
## {1, 2, 3, 4, 5}
set(ratings.map(lambda r: r.user).collect())
## {1, 2, 3, 4}
现在让我们创建测试数据并检查预测:
Now lets create test data and check predictions:
valid_test = sc.parallelize([(2, 5), (1, 4), (3, 5)])
valid_test
## ParallelCollectionRDD[434] at parallelize at PythonRDD.scala:423
model.predictAll(valid_test).count()
## 3
到目前为止一切顺利.接下来让我们使用与您的代码相同的逻辑来映射它:
So far so good. Next lets map it using the same logic as in your code:
valid_test_ = valid_test.map(lambda xs: tuple(int(x) for x in xs))
valid_test_
## PythonRDD[497] at RDD at PythonRDD.scala:43
model.predictAll(valid_test_).count()
## 3
还好.接下来让我们创建无效数据并重复实验:
Still fine. Next lets create invalid data and repeat experiment:
invalid_test = sc.parallelize([
(2, 6), # No product in the training data
(6, 1) # No user in the training data
])
invalid_test
## ParallelCollectionRDD[500] at parallelize at PythonRDD.scala:423
model.predictAll(invalid_test).count()
## 0
invalid_test_ = invalid_test.map(lambda xs: tuple(int(x) for x in xs))
model.predictAll(invalid_test_).count()
## 0
正如预期的那样,没有对无效输入的预测.
As expected there are no predictions for invalid input.
最后,您可以通过使用完全独立于 Python 代码的训练/预测的 ML 模型来确认情况确实如此:
Finally you can confirm this is really the case by using ML model which is completely independent in training / prediction from Python code:
from pyspark.ml.recommendation import ALS as MLALS
model_ml = MLALS(rank=10, maxIter=10).fit(
ratings.toDF(["user", "item", "rating"])
)
model_ml.transform((valid_test + invalid_test).toDF(["user", "item"])).show()
## +----+----+----------+
## |user|item|prediction|
## +----+----+----------+
## | 6| 1| NaN|
## | 1| 4| 1.0184212|
## | 2| 5| 4.0041084|
## | 3| 5|0.40498763|
## | 2| 6| NaN|
## +----+----+----------+
如您所见,在训练数据中没有相应的用户/项目意味着没有预测.
As you can see no corresponding user / item in the training data means no prediction.
这篇关于Spark ALS predictAll 返回空的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!