星火mllib predicting奇怪的数字或NaN [英] Spark mllib predicting weird number or NaN

查看:425
本文介绍了星火mllib predicting奇怪的数字或NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的Apache Spark和试图利用机器学习库predict一些数据。我的数据集,现在只有约350点。下面是这些要点7:

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:

"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289

下面是我的code:

def parsePoint(line):
    split = map(sanitize, line.split(','))
    rev = split.pop(-2)
    return LabeledPoint(rev, split)

def sanitize(value):
    return float(value.strip('"'))

parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)

print model.predict(parsedData.first().features)

在prediction是一些完全疯了,像 -6.92840330273e + 136 。如果我没有设置迭代列车(),然后我得到作为一个结果。我究竟做错了什么?这是我的数据集(它的大小,也许?)还是我的配置?

The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?

推荐答案

的问题是, LinearRegressionWithSGD 使用随机梯度下降(SGD),以优化线性的权重向量模型。 SGD是真正敏感所提供的 stepSize的用来更新中间解决方案。

The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.

什么SGD确实是计算梯度先按g 给出的输入点和当前权重的样本成本函数的是W 。为了更新权重是W 你走在先按g 的相反方向有一定的距离。距离是你的步长取值

What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.

w(i+1) = w(i) - s * g   

既然你没有提供一个明确的步长值,MLlib假定 stepSize的= 1 。这似乎不是你的用例的工作。我建议你​​尝试不同的步长,通常较低的值,怎么看 LinearRegressionWithSGD 表现:

Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:

LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001) 

这篇关于星火mllib predicting奇怪的数字或NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆