通过 pyspark.ml CrossValidator 调整隐式 pyspark.ml ALS 矩阵分解模型的参数 [英] Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

查看:73
本文介绍了通过 pyspark.ml CrossValidator 调整隐式 pyspark.ml ALS 矩阵分解模型的参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试调整使用隐式数据的 ALS 矩阵分解模型的参数.为此,我尝试使用 pyspark.ml.tuning.CrossValidator 来运行参数网格并选择最佳模型.我相信我的问题出在评估器上,但我想不通.

I'm trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I'm trying to use pyspark.ml.tuning.CrossValidator to run through a parameter grid and select the best model. I believe my problem is in the evaluator, but I can't figure it out.

我可以使用回归 RMSE 评估器使其适用于显式数据模型,如下所示:

I can get this to work for an explicit data model with a regression RMSE evaluator, as follows:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.sql.functions import rand


conf = SparkConf() \
  .setAppName("MovieLensALS") \
  .set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

dfRatings = sqlContext.createDataFrame([(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
                                 ["user", "item", "rating"])
dfRatingsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])

alsExplicit = ALS()
defaultModel = alsExplicit.fit(dfRatings)

paramMapExplicit = ParamGridBuilder() \
                    .addGrid(alsExplicit.rank, [8, 12]) \
                    .addGrid(alsExplicit.maxIter, [10, 15]) \
                    .addGrid(alsExplicit.regParam, [1.0, 10.0]) \
                    .build()

evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")

cvExplicit = CrossValidator(estimator=alsExplicit, estimatorParamMaps=paramMapExplicit, evaluator=evaluatorR)
cvModelExplicit = cvExplicit.fit(dfRatings)

predsExplicit = cvModelExplicit.bestModel.transform(dfRatingsTest)
predsExplicit.show()

当我尝试对隐式数据执行此操作时(假设是观看次数而不是评分),我收到一个我无法弄清楚的错误.这是代码(与上面的非常相似):

When I try to do this for implicit data (let's say counts of views rather than ratings), I get an error that I can't quite figure out. Here's the code (very similar to the above):

dfCounts = sqlContext.createDataFrame([(0,0,0), (0,1,12), (0,2,3), (1,0,5), (1,1,9), (1,2,0), (2,0,0), (2,1,11), (2,2,25)],
                                 ["user", "item", "rating"])
dfCountsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])

alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCounts)

paramMapImplicit = ParamGridBuilder() \
                    .addGrid(alsImplicit.rank, [8, 12]) \
                    .addGrid(alsImplicit.maxIter, [10, 15]) \
                    .addGrid(alsImplicit.regParam, [1.0, 10.0]) \
                    .addGrid(alsImplicit.alpha, [2.0,3.0]) \
                    .build()

evaluatorB = BinaryClassificationEvaluator(metricName="areaUnderROC", labelCol="rating")
evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")

cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR)
cvModel = cv.fit(dfCounts)

predsImplicit = cvModel.bestModel.transform(dfCountsTest)
predsImplicit.show()

我尝试使用 RMSE 求值器执行此操作,但出现错误.据我了解,我也应该能够将 AUC 度量用于二​​元分类评估器,因为隐式矩阵分解的预测是一个置信矩阵 c_ui,用于预测二元矩阵 p_ui 根据本文,pyspark ALS 的文档引用了该文件.

I tried doing this with an RMSE evaluator and I get an error. As I understand, I should also be able to use the AUC metric for the binary classification evaluator, because the predictions of the implicit matrix factorization are a confidence matrix c_ui for predictions of a binary matrix p_ui per this paper, which the documentation for pyspark ALS cites.

使用任何一个评估器都会给我一个错误,我找不到任何关于在线交叉验证隐式 ALS 模型的富有成效的讨论.我正在查看 CrossValidator 源代码以试图找出问题所在,但遇到了麻烦.我的一个想法是,在过程将隐式数据矩阵 r_ui 转换为二进制矩阵 p_ui 和置信矩阵 c_ui 之后,我不确定在评估阶段将预测的 c_ui 矩阵与什么进行比较.

Using either evaluator gives me an error and I can't find any fruitful discussion about cross-validating implicit ALS models online. I'm looking through the CrossValidator source code to try to figure out what's wrong, but am having trouble. One of my thoughts is that after the process converts the implicit data matrix r_ui to the binary matrix p_ui and confidence matrix c_ui, I'm not sure what it's comparing the predicted c_ui matrix against during the evaluation stage.

这里是错误:

Traceback (most recent call last):

  File "<ipython-input-16-6c43b997005e>", line 1, in <module>
    cvModel = cv.fit(dfCounts)

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 69, in fit
    return self._fit(dataset)

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\tuning.py", line 239, in _fit
    model = est.fit(train, epm[j])

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 67, in fit
    return self.copy(params)._fit(dataset)

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 133, in _fit
    java_model = self._fit_java(dataset)

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 130, in _fit_java
    return self._java_obj.fit(dataset._jdf)

  File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 813, in __call__
    answer, self.gateway_client, self.target_id, self.name)

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\sql\utils.py", line 45, in deco
    return f(*a, **kw)

  File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
    format(target_id, ".", name), value)

etc.......

更新

我尝试缩放输入,使其在 0 到 1 的范围内,并使用 RMSE 评估器.在我尝试将其插入 CrossValidator 之前,它似乎运行良好.

I tried scaling the input so it's in the range of 0 to 1 and using a RMSE evaluator. It seems to work well until I try to insert it into the CrossValidator.

以下代码有效.我得到预测,我从我的评估者那里得到一个 RMSE 值.

The following code works. I get predictions and i get an RMSE value from my evaluator.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import FloatType
import pyspark.sql.functions as F
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator


conf = SparkConf() \
  .setAppName("ALSPractice") \
  .set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

# Users 0, 1, 2, 3 - Items 0, 1, 2, 3, 4, 5 - Ratings 0.0-5.0
dfCounts2 = sqlContext.createDataFrame([(0,0,5.0), (0,1,5.0),            (0,3,0.0), (0,4,0.0), 
                                        (1,0,5.0),            (1,2,4.0), (1,3,0.0), (1,4,0.0),
                                        (2,0,0.0),            (2,2,0.0), (2,3,5.0), (2,4,5.0),
                                        (3,0,0.0), (3,1,0.0),            (3,3,4.0)            ],
                                       ["user", "item", "rating"])

dfCountsTest2 = sqlContext.createDataFrame([(0,0), (0,1), (0,2), (0,3), (0,4),
                                            (1,0), (1,1), (1,2), (1,3), (1,4),
                                            (2,0), (2,1), (2,2), (2,3), (2,4),
                                            (3,0), (3,1), (3,2), (3,3), (3,4)], ["user", "item"])

# Normalize rating data to [0,1] range based on max rating
colmax = dfCounts2.select(F.max('rating')).collect()[0].asDict().values()[0]
normalize = udf(lambda x: x/colmax, FloatType())
dfCountsNorm = dfCounts2.withColumn('ratingNorm', normalize(col('rating')))

alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCountsNorm)
preds = defaultModelImplicit.transform(dfCountsTest2)

evaluatorR2 = RegressionEvaluator(metricName="rmse", labelCol="ratingNorm")
evaluatorR2.evaluate(defaultModelImplicit.transform(dfCountsNorm))

preds = defaultModelImplicit.transform(dfCountsTest2)

我不明白的是为什么以下不起作用.我使用相同的估计器、相同的评估器并拟合相同的数据.为什么这些可以在上面工作,但不能在 CrossValidator 中工作:

What I don't understand is why the following doesn't work. I'm using the same estimator, the same evaluator and fitting the same data. Why would these work above but not within the CrossValidator:

paramMapImplicit = ParamGridBuilder() \
                    .addGrid(alsImplicit.rank, [8, 12]) \
                    .addGrid(alsImplicit.maxIter, [10, 15]) \
                    .addGrid(alsImplicit.regParam, [1.0, 10.0]) \
                    .addGrid(alsImplicit.alpha, [2.0,3.0]) \
                    .build()

cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR2)
cvModel = cv.fit(dfCountsNorm)

推荐答案

忽略技术问题,严格来说这两种方法都不正确,因为 ALS 生成的输入带有隐式反馈.

Ignoring technical issues, strictly speaking neither method is correct given the input generated by ALS with implicit feedback.

  • 您不能使用 RegressionEvaluator 因为,正如您已经知道的,预测可以解释为置信度值,并表示为 [0, 1] 范围内的浮点数,而标签列只是一个未绑定整数.这些值显然没有可比性.
  • 你不能使用 BinaryClassificationEvaluator 因为即使预测可以解释为概率标签也不代表二元决策.此外,预测列的类型无效,不能直接与 BinaryClassificationEvaluator
  • 一起使用
  • you cannot use RegressionEvaluator because, as you already know, prediction can be interpreted as a confidence value and is represented as a floating point number in range [0, 1] and label column is just an unbound integer. These values are clearly not comparable.
  • you cannot use BinaryClassificationEvaluator because even if the prediction can be interpreted as probability label doesn't represent binary decision. Moreover prediction column has invalid type and couldn't be used directly with BinaryClassificationEvaluator

您可以尝试转换其中一列,使输入符合要求,但从理论角度来看,这并不是真正合理的方法,并且会引入难以调整的其他参数.

You can try to convert one of the columns so input fit the requirements but this is is not really a justified approach from a theoretical perspective and introduces additional parameters which are hard to tune.

  • 将标签列映射到 [0, 1] 范围并使用 RMSE.

  • map label column to [0, 1] range and use RMSE.

将标签列转换为具有固定阈值的二进制指示符并扩展 ALS/ALSModel 以返回预期的列类型.假设阈值为 1 可能是这样的

convert label column to binary indicator with fixed threshold and extend ALS / ALSModel to return expected column type. Assuming threshold value is 1 it could be something like this

from pyspark.ml.recommendation import *
from pyspark.sql.functions import udf, col
from pyspark.mllib.linalg import DenseVector, VectorUDT

class BinaryALS(ALS):
    def fit(self, df):
        assert self.getImplicitPrefs()
        model = super(BinaryALS, self).fit(df)
        return ALSBinaryModel(model._java_obj)

class ALSBinaryModel(ALSModel):
    def transform(self, df):
        transformed = super(ALSBinaryModel, self).transform(df)
        as_vector = udf(lambda x: DenseVector([1 - x, x]), VectorUDT())
        return transformed.withColumn(
            "rawPrediction", as_vector(col("prediction")))

# Add binary label column
with_binary = dfCounts.withColumn(
    "label_binary", (col("rating") > 0).cast("double"))

als_binary_model = BinaryALS(implicitPrefs=True).fit(with_binary)

evaluatorB = BinaryClassificationEvaluator(
    metricName="areaUnderROC", labelCol="label_binary")

evaluatorB.evaluate(als_binary_model.transform(with_binary))
## 1.0

一般来说,教科书中缺少关于使用隐式反馈评估推荐系统的材料,我建议您阅读eliasah回答关于评估这类推荐人的问题.

Generally speaking, material about evaluating recommender systems with implicit feedbacks is kind of missing in textbooks, I suggest you take a read on eliasah's answer about evaluating these kind of recommenders.

这篇关于通过 pyspark.ml CrossValidator 调整隐式 pyspark.ml ALS 矩阵分解模型的参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆