星火mllib predicting奇怪的数字或NaN [英] Spark mllib predicting weird number or NaN
问题描述
我是新来的Apache Spark和试图利用机器学习库predict一些数据。我的数据集,现在只有约350点。下面是这些要点7:
I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
下面是我的code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
在prediction是一些完全疯了,像 -6.92840330273e + 136
。如果我没有设置迭代列车()
,然后我得到男
作为一个结果。我究竟做错了什么?这是我的数据集(它的大小,也许?)还是我的配置?
The prediction is something totally crazy, like -6.92840330273e+136
. If I don't set iterations in train()
, then I get nan
as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?
推荐答案
的问题是, LinearRegressionWithSGD
使用随机梯度下降(SGD),以优化线性的权重向量模型。 SGD是真正敏感所提供的 stepSize的
用来更新中间解决方案。
The problem is that LinearRegressionWithSGD
uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize
which is used to update the intermediate solution.
什么SGD确实是计算梯度先按g
给出的输入点和当前权重的样本成本函数的是W
。为了更新权重是W
你走在先按g
的相反方向有一定的距离。距离是你的步长取值
。
What SGD does is to calculate the gradient g
of the cost function given a sample of the input points and the current weights w
. In order to update the weights w
you go for a certain distance in the opposite direction of g
. The distance is your step size s
.
w(i+1) = w(i) - s * g
既然你没有提供一个明确的步长值,MLlib假定 stepSize的= 1
。这似乎不是你的用例的工作。我建议你尝试不同的步长,通常较低的值,怎么看 LinearRegressionWithSGD
表现:
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1
. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD
behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)
这篇关于星火mllib predicting奇怪的数字或NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!