星火ML管道Logistic回归生成那就更糟了predictions大于r GLM [英] Spark ML Pipeline Logistic Regression Produces Much Worse Predictions Than R GLM

查看:254
本文介绍了星火ML管道Logistic回归生成那就更糟了predictions大于r GLM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用ML管道运行逻辑回归模型,但由于某种原因,我得到了比R.我做了一些研究,只是我发现,是关系到这个问题后最坏的结果是<一个href=\"http://datascience.stackexchange.com/questions/5710/why-does-logistic-regression-in-spark-and-r-return-different-models-for-the-same\">this 。似乎星火Logistic回归返回模式,最大限度地减少损失函数,而 - [R GLM函数使用最大似然。在星火示范只拿到的记录71.3%,而右侧R能够正确predict的情况下,95.55%。我在想,如果我做错了什么的设立,以及是否有提高prediction的一种方式。以下是我的星火code和R code -

星火code

 部分model_input
标签,年龄,性别,Q1,Q2,Q3,Q4,Q5,DET_AGE_SQ
1.0,39,0,0,1,0,0,1,31.55709342560551
1.0,54,0,0,0,0,0,0,83.38062283737028
0.0,51,0,1,1,1,0,0,35.61591695501733高清trainModel(DF:数据帧):PipelineModel = {
  VAL LR =新逻辑回归()。setMaxIter(100000).setTol(0.0000000000000001)
  VAL管道=新管道()。setStages(阵列(LR))
  pipeline.fit(DF)
}VAL元= NominalAttribute.defaultAttr.withName(标签)。withValues​​(阵列(A,B))。toMetadataVAL汇编=新VectorAssembler()。
  setInputCols(阵列(年龄,性别,DET_AGE_SQ
 QA1,QA2,QA3,QA4,QA5))。
  setOutputCol(特征)VAL模型= trainModel(model_input)
VAL preD = model.transform(model_input)
pred.filter(标签!= prediction)。算

研究code

  LR&LT;  -  model_input%方式&gt;%GLM(=数据,公式=标签〜年龄+性别+ Q1 + Q2 + Q3 + Q4 + Q5 + DET_AGE_SQ,
          家庭=二项式)
preD&LT; - data.frame(Y = model_input $标签,P =装(LR))
表(preD $ Y,$ P $ $ PD P&GT; 0.5)

随时让我知道如果你需要任何其他信息。谢谢!

修改2015年9月18日我试图增加最大迭代,并显着降低了宽容。不幸的是,它并没有提高prediction。它似乎收敛于一个局部最小值,而非整体最小值的模式。


解决方案

  

似乎星火Logistic回归返回模式,最大限度地减少损失函数,且R GLM函数使用最大似然。


一个损失函数的最小化是pretty线性模型太大的定义和两个 GLM ml.classification.LogisticRegression 在这里没有什么不同。这两者之间根本区别在于它是如何实现的方式。

从ML / MLlib所有的线性模型是基于梯度下降的一些变种。使用这种方法生成的模型质量按个别情况有所不同,取决于梯度下降,正规化的参数。

从另一方面ř计算精确的解决方案,它,因为它的时间复杂度,不适合大型数据集。

正如我上面用GS生成的模型质量提到取决于输入参数,以提高它是执行超参数优化等等典型方式。不幸的是ML版本,而这里的限制相比,MLlib,但对于初学者,你可以增加迭代次数。

I used ML PipeLine to run logistic regression models but for some reasons I got worst results than R. I have done some researches and the only post that I found that is related to this issue is this . It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood. The Spark model only got 71.3% of the records right while R can predict 95.55% of the cases correctly. I was wondering if I did something wrong on the set up and if there's a way to improve the prediction. The below is my Spark code and R code-

Spark code

partial model_input  
label,AGE,GENDER,Q1,Q2,Q3,Q4,Q5,DET_AGE_SQ  
1.0,39,0,0,1,0,0,1,31.55709342560551  
1.0,54,0,0,0,0,0,0,83.38062283737028  
0.0,51,0,1,1,1,0,0,35.61591695501733



def trainModel(df: DataFrame): PipelineModel = {  
  val lr  = new LogisticRegression().setMaxIter(100000).setTol(0.0000000000000001)  
  val pipeline = new Pipeline().setStages(Array(lr))  
  pipeline.fit(df)  
}

val meta =  NominalAttribute.defaultAttr.withName("label").withValues(Array("a", "b")).toMetadata

val assembler = new VectorAssembler().
  setInputCols(Array("AGE","GENDER","DET_AGE_SQ",
 "QA1","QA2","QA3","QA4","QA5")).
  setOutputCol("features")

val model = trainModel(model_input)
val pred= model.transform(model_input)  
pred.filter("label!=prediction").count

R code

lr <- model_input %>% glm(data=., formula=label~ AGE+GENDER+Q1+Q2+Q3+Q4+Q5+DET_AGE_SQ,
          family=binomial)
pred <- data.frame(y=model_input$label,p=fitted(lr))
table(pred $y, pred $p>0.5)

Feel free to let me know if you need any other information. Thank you!

Edit 9/18/2015 I have tried increasing the maximum iteration and decreasing the tolerance dramatically. Unfortunately, it didn't improve the prediction. It seems the model converged to a local minimum instead of the global minimum.

解决方案

It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood.

Minimization of a loss function is pretty much a definition of the linear models and both glm and ml.classification.LogisticRegression are no different here. Fundamental difference between these two is the way how it is achieved.

All linear models from ML/MLlib are based on some variants of Gradient descent. Quality of the model generated using this approach vary on a case by case basis and depend on the Gradient Descent and regularization parameters.

R from the other hand computes an exact solution which, given its time complexity, is not well suited for large datasets.

As I've mentioned above quality of the model generated using GS depends on the input parameters so typical way to improve it is to perform hyperparameter optimization. Unfortunately ML version is rather limited here compared to MLlib but for starters you can increase a number of iterations.

这篇关于星火ML管道Logistic回归生成那就更糟了predictions大于r GLM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆