Spark-MLlib线性回归截距和权重NaN [英] Spark - MLlib linear regression intercept and weight NaN
问题描述
我试图使用一些自定义数据在Spark上构建回归模型,并且截距和权重始终为nan
.
这是我的数据:
I have trying to build a regression model on Spark using some custom data and the intercept and weights are always nan
.
This is my data:
data = [LabeledPoint(0.0, [27022.0]), LabeledPoint(1.0, [27077.0]), LabeledPoint(2.0, [27327.0]), LabeledPoint(3.0, [27127.0])]
输出:
(weights=[nan], intercept=nan)
但是,如果我使用此数据集(取自Spark示例),它将返回非nan
权重并进行拦截.
However, if I use this dataset (taken from Spark examples), it returns a non nan
weight and intercept.
data = [LabeledPoint(0.0, [0.0]), LabeledPoint(1.0, [1.0]), LabeledPoint(3.0, [2.0]),LabeledPoint(2.0, [3.0])]
输出:
(weights=[0.798729902914], intercept=0.3027117101297481)
这是我当前的代码
model = LinearRegressionWithSGD.train(sc.parallelize(data), intercept=True)
我错过了什么吗?是因为我数据上的数字那么大吗?这是我第一次使用MLlib,所以我可能会遗漏一些细节.
Am I missing something? Is it because the numbers on my data are that big? It is my first time using MLlib so I might be missing some details.
谢谢
推荐答案
MLlib线性回归基于SGD,因此您需要调整迭代次数和步长,请参见
MLlib linear regression is SGD based therefore you need to tweak iterations and step size, see https://spark.apache.org/docs/latest/mllib-optimization.html.
我尝试了这样的自定义数据,并在scala中得到了一些结果:
I tried your custom data like this and I got some results (in scala):
val numIterations = 20
val model = LinearRegressionWithSGD.train(sc.parallelize(data), numIterations)
这篇关于Spark-MLlib线性回归截距和权重NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!