使用基于Spark数据集的ML API时是否初始化逻辑回归系数? [英] Initializing logistic regression coefficients when using the Spark dataset-based ML APIs?

查看:102
本文介绍了使用基于Spark数据集的ML API时是否初始化逻辑回归系数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

默认情况下,逻辑回归训练会将系数初始化为全零.但是,我想自己初始化这些系数.例如,如果先前的训练运行在几次迭代后崩溃,这将很有用-我可以使用最后一组已知的系数重新开始训练.

任何基于数据集/数据框的API(最好是Scala)是否有可能?

看一下Spark源代码,似乎有 方法setInitialModel来初始化模型及其系数,基于RDD的API处于维护模式.

我总是可以尝试使用反射来调用私有的setInitialModel方法,但是如果可能的话,我想避免这种情况(也许甚至行不通……我也无法确定setInitialModel是否是有充分的理由将其标记为私有).

解决方案

随意重写该方法.是的,您需要将该课程复制到自己的工作区中.很好:不要害怕.

在构建项目时(通过mavensbt),您的类的本地副本将获胜"并在MLlib中将其阴影化.幸运的是,同一包中的其他类将被着色.

我已经多次使用这种方法来覆盖Spark类:实际上,您的构建时间也应该很小.

By default, logistic regression training initializes the coefficients to be all-zero. However, I would like to initialize the coefficients myself. This would be useful, for example, if a previous training run crashed after several iterations -- I could simply restart training with the last known set of coefficients.

Is this possible with any of the dataset/dataframe-based APIs, preferably Scala?

Looking at the Spark source code, it seems that there is a method setInitialModel to initialize the model and its coefficients, but it's unfortunately marked as private.

The RDD-based API seems to allow initializing coefficients: one of the overloads of LogisticRegressionWithSGD.run(...) accepts an initialWeights vector. However, I would like to use the dataset-based API instead of the RDD-based API because (1) the former supports elastic net regularization (I couldn't figure out how to do elastic net with the RDD-based logistic regression) and (2) because the RDD-based API is in maintenance mode.

I could always try using reflection to call that private setInitialModel method, but I would like to avoid this if possible (and maybe that wouldn't even work... I also can't tell if setInitialModel is marked private for a good reason).

解决方案

Feel free to override the method. Yes you will need to copy that class into your own work area. That's fine: do not fear.

When you build your project -either via maven or sbt - your local copy of the class will "win" and shade the one in MLlib. Fortunately the other classes in that same package will not be shaded.

I have used this approach many times with overriding Spark classes: actually your build times should be small as well.

这篇关于使用基于Spark数据集的ML API时是否初始化逻辑回归系数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆