火花矩阵分解的预测时间 [英] The prediction time of spark matrix factorization
问题描述
我有一个简单的Python应用程序.采取具有user_id,product_id,rating的rating.csv其中包含4 M条记录,然后我使用Spark AlS并保存模型,然后将其加载到matrixFactorization.
我的方法问题是,预测用户和产品之间的评分需要花费一秒钟以上的时间.我的服务器是32 G和8核.任何建议如何可以将预测时间缩短到不到100毫秒.以及数据集中的多个记录与预测时间之间的关系.这是我在做什么:
spark_config = SparkConf().setAll([('spark.executor.memory','32g'),('spark.cores.max','8')])als_recommender.sc = SparkContext(conf = spark_config)#training_data是4 M记录的郁金香数组training_data = als_recommender.sc.parallelize(training_data)als_recommender.model = ALS.trainImplicit(training_data,10,10,nonnegative = True)als_recommender.model.save(als_recommender.sc,".... Ameer/als_model")als_recommender_model = MatrixFactorizationModel.load(als_recommender.sc,".... Ameer/als_model")als_recommender_model.predict(1,2913)
基本上,您不需要每次需要回答时都加载完整的模型.
根据模型更新的频率和预测查询的数量,我可以选择:
- 将模型保留在内存中,并能够从那里回答查询.对于答案<100ms,您将需要测量每个步骤.Livy可能是个不错的选择,但我不确定它的开销.
- 输出每个用户的前X个预测并将其存储在DB中.Redis是一个很好的候选者,因为它的速度很快,值可以是列表
I have simple Python app. take ratings.csv which has user_id, product_id, rating which contains 4 M record then I use Spark AlS and save the model, then I load it to matrixFactorization.
my problem with method predicts which takes more than one second to predict the rating between user and product. my server is 32 G and 8 cores. any suggestion how I can enhance the prediction time to be less than 100milisecond. and what the relationship between a number of records in the data set and the prediction time.
Here is what I am doing :
spark_config = SparkConf().setAll([('spark.executor.memory', '32g'), ('spark.cores.max', '8')])
als_recommender.sc = SparkContext(conf=spark_config) #training_data is array of tulips of 4 M record
training_data = als_recommender.sc.parallelize(training_data) als_recommender.model = ALS.trainImplicit(training_data, 10, 10, nonnegative=True)
als_recommender.model.save(als_recommender.sc, "....Ameer/als_model")
als_recommender_model = MatrixFactorizationModel.load(als_recommender.sc, "....Ameer/als_model")
als_recommender_model.predict(1,2913)
Basically, you do not want to have to load the full model everytime you need to answer.
Depending on the model update frequency and in the number of prediction queries, I would either :
- keep the model in memory and being able to answer to queries from there. For answer < 100ms, you will need to measure each step. Livy can be a good catch but I am not sure on its overhead.
- output the top X predictions for each user and store them in DB. Redis is a good candidate as its fast, values can be a list
这篇关于火花矩阵分解的预测时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!