为什么我的Spark SVM总是预测相同的标签? [英] Why is my Spark SVM always predicting the same label?
问题描述
我无法让我的SVM预测我期望的0和1.看来,在我训练它并提供更多数据之后,它总是希望预测1或0,但是它将预测所有1或全部0,而不是两者的混合.我想知道你们中的一个人能否告诉我我在做错什么.
I'm having trouble getting my SVM to predict 0's and 1's where I would expect it to. It seems that after I train it and give it more data, it always wants to predict a 1 or a 0, but it will predict all 1's or all 0's, and never a mix of the two. I'm wondering if one of you could tell me what I'm doing wrong.
我搜索了"svm总是预测相同的值"和类似的问题,对于刚接触机器学习的我们来说,这似乎很常见.恐怕我不明白所遇到的答案.
I've searched for "svm always predicting same value" and similar problems, and it looks like this is pretty common for those of us new to machine learning. I'm afraid though that I don't understand the answers that I've come across.
所以我从这个开始,它或多或少地起作用:
So I start off with this, and it more or less works:
from pyspark.mllib.regression import LabeledPoint
cooked_rdd = sc.parallelize([LabeledPoint(0, [0]), LabeledPoint(1, [1])])
from pyspark.mllib.classification import SVMWithSGD
model = SVMWithSGD.train(cooked_rdd)
我说或多或少"是因为
model.predict([0])
Out[47]: 0
是我所期望的,并且...
is what I would expect, and...
model.predict([1])
Out[48]: 1
也是我所期望的,但是...
is also what I would expect, but...
model.predict([0.000001])
Out[49]: 1
绝对不是我所期望的.我认为造成问题的根本原因是我的问题.
is definitely not what I expected. I think that whatever is causing that is at the root of my problems.
在这里,我首先准备数据...
Here I start by cooking my data...
def cook_data():
x = random()
y = random()
dice = 0.25 + (random() * 0.5)
if x**2 + y**2 > dice:
category = 0
else:
category = 1
return LabeledPoint(category, [x, y])
cooked_data = []
for i in range(0,5000):
cooked_data.append(cook_data())
...我得到了美丽的点云.当我绘制它们时,我得到了一个有一点点混乱区域的划分,但是任何幼儿园的人都可以画一条线来将它们分开.那为什么当我尝试画一条线将它们分开时...
... and I get a beautiful cloud of points. When I plot them I get a division with a little bit of a muddled area, but any kindergartner could draw a line to separate them. So why is that when I try drawing a line to separate them...
cooked_rdd = sc.parallelize(cooked_data)
training, testing = cooked_rdd.randomSplit([0.9, 0.1], seed = 1)
model = SVMWithSGD.train(training)
prediction_and_label = testing.map(lambda p : (model.predict(p.features), p.label))
...我只能将它们分成一组,而不是两个? (下面的列表显示了SVM预测的元组,以及答案应该是什么.)
...I can only lump them into one group, and not two? (Below is a list that shows tuples of what the SVM predicted, and what the answer should have been.)
prediction_and_label.collect()
Out[54]:
[(0, 1.0),
(0, 0.0),
(0, 0.0),
(0, 1.0),
(0, 0.0),
(0, 0.0),
(0, 1.0),
(0, 0.0),
(0, 1.0),
(0, 1.0),
...
以此类推.当应该有一个很明显的除法开始猜测1时,它只会猜测0.有人可以告诉我我做错了什么吗?感谢您的帮助.
And so on. It only ever guesses 0, when there should be a pretty obvious division where it should start guessing 1. Can anyone tell me what I'm doing wrong? Thanks for your help.
我认为这不是规模问题,就像其他一些有类似问题的帖子所建议的那样.我尝试将所有内容乘以100,但仍然遇到相同的问题.我也尝试使用如何计算"dice"变量,但是我所能做的就是将SVM的猜测值从全0更改为全1.
I don't think it's a problem with scale, as was suggested in some other posts with similar problems. I've tried multiplying everything by 100, and I still get the same problem. I also try playing with how I calculate my "dice" variable, but all I can do is change the SVM's guesses from all 0's to all 1's.
推荐答案
我弄清楚了为什么它总是预测全1或全0.我需要添加以下行:
I figured out why it's always predicting either all 1's or all 0's. I need to add this line:
model.setThreshold(0.5)
可以解决此问题.我在使用
That fixes it. I figured it out after using
model.clearThreshold()
clearThreshold,然后是预测测试数据,告诉我计算机正在预测的是浮点数,而不只是我最终要寻找的二进制0或1.我可以看到SVM正在做出我认为违反直觉的舍入决策.通过使用setThreshold,我现在可以获得更好的结果.
clearThreshold, followed by predicting test data, told me what the computer was predicting down to a floating point, and not just to the binary 0 or 1 I'm ultimately looking for. I could see that the SVM was making what I considered a counterintuitive rounding decision. By using setThreshold, I'm now able to get much better results.
这篇关于为什么我的Spark SVM总是预测相同的标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!