为什么我的 Spark SVM 总是预测相同的标签? [英] Why is my Spark SVM always predicting the same label?
问题描述
我在让 SVM 预测 0 和 1 的位置时遇到了麻烦.似乎在我训练它并给它更多数据之后,它总是想预测一个 1 或一个 0,但它会预测全 1 或全 0,而不是两者的混合.我想知道你们中是否有人能告诉我我做错了什么.
I'm having trouble getting my SVM to predict 0's and 1's where I would expect it to. It seems that after I train it and give it more data, it always wants to predict a 1 or a 0, but it will predict all 1's or all 0's, and never a mix of the two. I'm wondering if one of you could tell me what I'm doing wrong.
我搜索过svm 总是预测相同的值"和类似的问题,对于我们这些机器学习新手来说,这看起来很常见.恐怕我不明白我遇到的答案.
I've searched for "svm always predicting same value" and similar problems, and it looks like this is pretty common for those of us new to machine learning. I'm afraid though that I don't understand the answers that I've come across.
所以我从这个开始,它或多或少是有效的:
So I start off with this, and it more or less works:
from pyspark.mllib.regression import LabeledPoint
cooked_rdd = sc.parallelize([LabeledPoint(0, [0]), LabeledPoint(1, [1])])
from pyspark.mllib.classification import SVMWithSGD
model = SVMWithSGD.train(cooked_rdd)
我说或多或少"是因为
model.predict([0])
Out[47]: 0
正是我所期望的,而且...
is what I would expect, and...
model.predict([1])
Out[48]: 1
也是我所期望的,但是...
is also what I would expect, but...
model.predict([0.000001])
Out[49]: 1
绝对不是我所期望的.我认为无论是什么原因造成的都是我问题的根源.
is definitely not what I expected. I think that whatever is causing that is at the root of my problems.
在这里,我首先处理我的数据...
Here I start by cooking my data...
def cook_data():
x = random()
y = random()
dice = 0.25 + (random() * 0.5)
if x**2 + y**2 > dice:
category = 0
else:
category = 1
return LabeledPoint(category, [x, y])
cooked_data = []
for i in range(0,5000):
cooked_data.append(cook_data())
...然后我得到了一团美丽的点云.当我绘制它们时,我会得到一个带有一点混乱区域的分区,但是任何幼儿园的孩子都可以画一条线来将它们分开.那么为什么当我尝试画一条线将它们分开时...
... and I get a beautiful cloud of points. When I plot them I get a division with a little bit of a muddled area, but any kindergartner could draw a line to separate them. So why is that when I try drawing a line to separate them...
cooked_rdd = sc.parallelize(cooked_data)
training, testing = cooked_rdd.randomSplit([0.9, 0.1], seed = 1)
model = SVMWithSGD.train(training)
prediction_and_label = testing.map(lambda p : (model.predict(p.features), p.label))
...我只能将它们归为一组,不能归为两组?(下面的列表显示了 SVM 预测的元组,以及答案应该是什么.)
...I can only lump them into one group, and not two? (Below is a list that shows tuples of what the SVM predicted, and what the answer should have been.)
prediction_and_label.collect()
Out[54]:
[(0, 1.0),
(0, 0.0),
(0, 0.0),
(0, 1.0),
(0, 0.0),
(0, 0.0),
(0, 1.0),
(0, 0.0),
(0, 1.0),
(0, 1.0),
...
等等.它只猜测 0,当应该有一个非常明显的划分时它应该开始猜测 1.谁能告诉我我做错了什么?感谢您的帮助.
And so on. It only ever guesses 0, when there should be a pretty obvious division where it should start guessing 1. Can anyone tell me what I'm doing wrong? Thanks for your help.
我认为这不是比例问题,正如其他一些有类似问题的帖子所建议的那样.我试过把所有东西都乘以 100,但我仍然遇到同样的问题.我也尝试玩弄我如何计算骰子"变量,但我所能做的就是将 SVM 的猜测从全 0 更改为全 1.
I don't think it's a problem with scale, as was suggested in some other posts with similar problems. I've tried multiplying everything by 100, and I still get the same problem. I also try playing with how I calculate my "dice" variable, but all I can do is change the SVM's guesses from all 0's to all 1's.
推荐答案
我明白为什么它总是预测全 1 或全 0.我需要添加这一行:
I figured out why it's always predicting either all 1's or all 0's. I need to add this line:
model.setThreshold(0.5)
这就解决了.使用后我想通了
That fixes it. I figured it out after using
model.clearThreshold()
clearThreshold,然后是预测测试数据,告诉我计算机预测的是浮点数,而不仅仅是我最终要寻找的二进制 0 或 1.我可以看到 SVM 正在做出我认为违反直觉的舍入决定.通过使用 setThreshold,我现在可以获得更好的结果.
clearThreshold, followed by predicting test data, told me what the computer was predicting down to a floating point, and not just to the binary 0 or 1 I'm ultimately looking for. I could see that the SVM was making what I considered a counterintuitive rounding decision. By using setThreshold, I'm now able to get much better results.
这篇关于为什么我的 Spark SVM 总是预测相同的标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!