如何使用Spark MlLib/管道为每个用户构建1个模型 [英] How to use Spark MlLib/Pipelines to build 1 model per each user

查看:73
本文介绍了如何使用Spark MlLib/管道为每个用户构建1个模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为数据集中的每个用户训练不同的模型. Spark MlLib/管道中是否对此提供内置支持?

I want to train different models for each user in my dataset. Is there built in support for that in Spark MlLib/Pipelines?

如果没有,为每个用户训练多个和单独模型的最简单/最干净的方法是什么?

If not, what's the easiest/cleanest way to train multiple and separate models for each user?

推荐答案

不幸的是,Spark-ML没有提供将概念单个模型-单用户"分开的功能.但是您可以根据需要创建自定义逻辑.我看到解决此任务的两种可能的变体. 解决这种情况的第一种情况是执行下一个算法(我以所有示例为例-您将有不同的步骤,但是算法在逻辑上将是相似的):

Unfortunately Spark-ML doesn't provide the ability to separate concept "single model - single user". But you can make a custom logic as you wish. I see two possible variants of solving this task. The first scenario for solving this situation is following to the next algorithm (I took everything for example - you will have different steps, but algorithm will logically similar):

  • 您必须获取特定用户的培训数据-(例如,从hdfs,s3等读取数据csv文件)
  • Dataset的培训模型,该模型取决于用户相关数据-让我们考虑数据集有两列的下一种情况-特定条件X和用户的生产率Y,而最新参数对于用户组而言是可变的-您必须使用LinearRegression例如训练模型,以便预测用户是否可以在一段时间内完成工作.
  • 接下来,您将数据保存到经过呼叫训练的型号上的磁盘上,具体取决于 用户的ID,组等.
  • You must obtain training data for the specific user - (e.g. read data csv file from hdfs, s3 etc.)
  • Train model for the Dataset which depends on the user related data - let's consider the next situation your dataset has two columns - the specific criteria X and user's productivity Y and latest parameter is changeable for user group - you must train your model for instance with LinearRegression so predict if user can do work in the time or can't.
  • Next, you save data to the disk on call trained model depending on the user's id, group or etc.

第二种方法是训练模型,使其适用于每个用户,您必须选择算法选项,以使其不依赖于用户组,换句话说,将训练模型的算法推广到所有用户组-在这种情况下,您没有分离感
单一模型->单个用户".如果第二种变体对您的数据集实施更为复杂,请采用第一种方法.

The second approach is to train your model so it was applicable to every user, you must choose options for algorithm so it didn't depend on group of user, in other words, generalize algorithm of training model to all user groups - in this case, you don't have a sense of separation
"single-model--> single user". If the second variant is more complicated to the implementation on your dataset, follow the first approach.

这篇关于如何使用Spark MlLib/管道为每个用户构建1个模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆