火花矩阵分解的预测时间 [英] The prediction time of spark matrix factorization

查看:55
本文介绍了火花矩阵分解的预测时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的Python应用程序.采取具有user_id,product_id,rating的rating.csv其中包含4 M条记录,然后我使用Spark AlS并保存模型,然后将其加载到matrixFactorization.

我的方法问题是,预测用户和产品之间的评分需要花费一秒钟以上的时间.我的服务器是32 G和8核.任何建议如何可以将预测时间缩短到不到100毫秒.以及数据集中的多个记录与预测时间之间的关系.

这是我在做什么:

  spark_config = SparkConf().setAll([('spark.executor.memory','32g'),('spark.cores.max','8')])als_recommender.sc = SparkContext(conf = spark_config)#training_data是4 M记录的郁金香数组training_data = als_recommender.sc.parallelize(training_data)als_recommender.model = ALS.trainImplicit(training_data,10,10,nonnegative = True)als_recommender.model.save(als_recommender.sc,".... Ameer/als_model")als_recommender_model = MatrixFactorizationModel.load(als_recommender.sc,".... Ameer/als_model")als_recommender_model.predict(1,2913) 

解决方案

基本上,您不需要每次需要回答时都加载完整的模型.

根据模型更新的频率和预测查询的数量,我可以选择:

  • 将模型保留在内存中,并能够从那里回答查询.对于答案<100ms,您将需要测量每个步骤.Livy可能是个不错的选择,但我不确定它的开销.
  • 输出每个用户的前X个预测并将其存储在DB中.Redis是一个很好的候选者,因为它的速度很快,值可以是列表

I have simple Python app. take ratings.csv which has user_id, product_id, rating which contains 4 M record then I use Spark AlS and save the model, then I load it to matrixFactorization.

my problem with method predicts which takes more than one second to predict the rating between user and product. my server is 32 G and 8 cores. any suggestion how I can enhance the prediction time to be less than 100milisecond. and what the relationship between a number of records in the data set and the prediction time.

Here is what I am doing :

spark_config = SparkConf().setAll([('spark.executor.memory', '32g'), ('spark.cores.max', '8')]) 
als_recommender.sc = SparkContext(conf=spark_config) #training_data is array of tulips of 4 M record 
training_data = als_recommender.sc.parallelize(training_data) als_recommender.model = ALS.trainImplicit(training_data, 10, 10, nonnegative=True) 
als_recommender.model.save(als_recommender.sc, "....Ameer/als_model") 
als_recommender_model = MatrixFactorizationModel.load(als_recommender.sc, "....Ameer/als_model") 
als_recommender_model.predict(1,2913)

解决方案

Basically, you do not want to have to load the full model everytime you need to answer.

Depending on the model update frequency and in the number of prediction queries, I would either :

  • keep the model in memory and being able to answer to queries from there. For answer < 100ms, you will need to measure each step. Livy can be a good catch but I am not sure on its overhead.
  • output the top X predictions for each user and store them in DB. Redis is a good candidate as its fast, values can be a list

这篇关于火花矩阵分解的预测时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆