Spark mllib 预测奇怪的数字或 NaN [英] Spark mllib predicting weird number or NaN

查看:28
本文介绍了Spark mllib 预测奇怪的数字或 NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Apache Spark 的新手,正在尝试使用机器学习库来预测一些数据.我现在的数据集只有大约 350 个点.以下是其中 7 点:

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:

"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289

这是我的代码:

def parsePoint(line):
    split = map(sanitize, line.split(','))
    rev = split.pop(-2)
    return LabeledPoint(rev, split)

def sanitize(value):
    return float(value.strip('"'))

parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)

print model.predict(parsedData.first().features)

预测完全是疯狂的,比如-6.92840330273e+136.如果我不在 train() 中设置迭代,那么我会得到 nan 结果.我究竟做错了什么?是我的数据集(可能是它的大小?)还是我的配置?

The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?

推荐答案

问题在于 LinearRegressionWithSGD 使用随机梯度下降 (SGD) 来优化线性模型的权重向量.SGD 对提供的用于更新中间解决方案的 stepSize 非常敏感.

The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.

SGD 所做的是在给定输入点样本和当前权重 w 的情况下计算成本函数的梯度 g.为了更新权重 w,你需要在 g 的相反方向上移动一段距离.距离是你的步长s.

What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.

w(i+1) = w(i) - s * g   

由于您没有提供明确的步长值,MLlib 假定 stepSize = 1.这似乎不适用于您的用例.我建议您尝试不同的步长,通常是较低的值,以查看 LinearRegressionWithSGD 的行为:

Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:

LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001) 

这篇关于Spark mllib 预测奇怪的数字或 NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆