如何使用 Spark MlLib/Pipelines 为每个用户构建 1 个模型 [英] How to use Spark MlLib/Pipelines to build 1 model per each user

查看:27
本文介绍了如何使用 Spark MlLib/Pipelines 为每个用户构建 1 个模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为数据集中的每个用户训练不同的模型.Spark MlLib/Pipelines 中是否有内置支持?

I want to train different models for each user in my dataset. Is there built in support for that in Spark MlLib/Pipelines?

如果不是,为每个用户训练多个独立模型的最简单/最简洁的方法是什么?

If not, what's the easiest/cleanest way to train multiple and separate models for each user?

推荐答案

遗憾的是 Spark-ML 没有提供分离概念单一模型 - 单一用户"的能力.但是您可以根据需要制作自定义逻辑.我看到解决此任务的两种可能变体.解决这种情况的第一个场景是遵循下一个算法(我以一切为例-您将有不同的步骤,但算法在逻辑上会相似):

Unfortunately Spark-ML doesn't provide the ability to separate concept "single model - single user". But you can make a custom logic as you wish. I see two possible variants of solving this task. The first scenario for solving this situation is following to the next algorithm (I took everything for example - you will have different steps, but algorithm will logically similar):

  • 您必须获取特定用户的训练数据 -(例如,从 hdfs、s3 等读取数据 csv 文件)
  • Dataset 的训练模型取决于用户相关数据 - 让我们考虑下一种情况,您的数据集有两列 - 具体标准 X 和用户的生产力 Y 和用户组的最新参数是可更改的 - 例如,您必须使用 LinearRegression 训练您的模型,以便预测用户是否可以在该时间内完成工作.
  • 接下来,您将数据保存到磁盘上,根据调用训练的模型用户 ID、组等.
  • You must obtain training data for the specific user - (e.g. read data csv file from hdfs, s3 etc.)
  • Train model for the Dataset which depends on the user related data - let's consider the next situation your dataset has two columns - the specific criteria X and user's productivity Y and latest parameter is changeable for user group - you must train your model for instance with LinearRegression so predict if user can do work in the time or can't.
  • Next, you save data to the disk on call trained model depending on the user's id, group or etc.

第二种方法是训练您的模型,使其适用于每个用户,您必须选择算法选项,使其不依赖于用户组,换句话说,将训练模型的算法推广到所有用户组 -在这种情况下,您没有分离感
单模型--> 单用户".如果第二个变体在您的数据集上的实现更复杂,请遵循第一个方法.

The second approach is to train your model so it was applicable to every user, you must choose options for algorithm so it didn't depend on group of user, in other words, generalize algorithm of training model to all user groups - in this case, you don't have a sense of separation
"single-model--> single user". If the second variant is more complicated to the implementation on your dataset, follow the first approach.

这篇关于如何使用 Spark MlLib/Pipelines 为每个用户构建 1 个模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆