为什么我的 Spark SVM 总是预测相同的标签? [英] Why is my Spark SVM always predicting the same label?

查看:29
本文介绍了为什么我的 Spark SVM 总是预测相同的标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在让 SVM 预测 0 和 1 的位置时遇到了麻烦.似乎在我训练它并给它更多数据之后,它总是想预测一个 1 或一个 0,但它会预测全 1 或全 0,而不是两者的混合.我想知道你们中是否有人能告诉我我做错了什么.

I'm having trouble getting my SVM to predict 0's and 1's where I would expect it to. It seems that after I train it and give it more data, it always wants to predict a 1 or a 0, but it will predict all 1's or all 0's, and never a mix of the two. I'm wondering if one of you could tell me what I'm doing wrong.

我搜索过svm 总是预测相同的值"和类似的问题,对于我们这些机器学习新手来说,这看起来很常见.恐怕我不明白我遇到的答案.

I've searched for "svm always predicting same value" and similar problems, and it looks like this is pretty common for those of us new to machine learning. I'm afraid though that I don't understand the answers that I've come across.

所以我从这个开始,它或多或少是有效的:

So I start off with this, and it more or less works:

from pyspark.mllib.regression import LabeledPoint
cooked_rdd = sc.parallelize([LabeledPoint(0, [0]), LabeledPoint(1, [1])])
from pyspark.mllib.classification import SVMWithSGD
model = SVMWithSGD.train(cooked_rdd)

我说或多或少"是因为

model.predict([0])
Out[47]: 0

正是我所期望的,而且...

is what I would expect, and...

model.predict([1])
Out[48]: 1

也是我所期望的,但是...

is also what I would expect, but...

model.predict([0.000001])
Out[49]: 1

绝对不是我所期望的.我认为无论是什么原因造成的都是我问题的根源.

is definitely not what I expected. I think that whatever is causing that is at the root of my problems.

在这里,我首先处理我的数据...

Here I start by cooking my data...

def cook_data():
  x = random()
  y = random()
  dice = 0.25 + (random() * 0.5)
  if x**2 + y**2 > dice:
    category = 0
  else:
    category = 1
  return LabeledPoint(category, [x, y])

cooked_data = []
for i in range(0,5000):
  cooked_data.append(cook_data())

...然后我得到了一团美丽的点云.当我绘制它们时,我会得到一个带有一点混乱区域的分区,但是任何幼儿园的孩子都可以画一条线来将它们分开.那么为什么当我尝试画一条线将它们分开时...

... and I get a beautiful cloud of points. When I plot them I get a division with a little bit of a muddled area, but any kindergartner could draw a line to separate them. So why is that when I try drawing a line to separate them...

cooked_rdd = sc.parallelize(cooked_data)
training, testing = cooked_rdd.randomSplit([0.9, 0.1], seed = 1)
model = SVMWithSGD.train(training)
prediction_and_label = testing.map(lambda p : (model.predict(p.features), p.label))

...我只能将它们归为一组,不能归为两组?(下面的列表显示了 SVM 预测的元组,以及答案应该是什么.)

...I can only lump them into one group, and not two? (Below is a list that shows tuples of what the SVM predicted, and what the answer should have been.)

prediction_and_label.collect()
Out[54]: 
[(0, 1.0),
 (0, 0.0),
 (0, 0.0),
 (0, 1.0),
 (0, 0.0),
 (0, 0.0),
 (0, 1.0),
 (0, 0.0),
 (0, 1.0),
 (0, 1.0),
...

等等.它只猜测 0,当应该有一个非常明显的划分时它应该开始猜测 1.谁能告诉我我做错了什么?感谢您的帮助.

And so on. It only ever guesses 0, when there should be a pretty obvious division where it should start guessing 1. Can anyone tell me what I'm doing wrong? Thanks for your help.

我认为这不是比例问题,正如其他一些有类似问题的帖子所建议的那样.我试过把所有东西都乘以 100,但我仍然遇到同样的问题.我也尝试玩弄我如何计算骰子"变量,但我所能做的就是将 SVM 的猜测从全 0 更改为全 1.

I don't think it's a problem with scale, as was suggested in some other posts with similar problems. I've tried multiplying everything by 100, and I still get the same problem. I also try playing with how I calculate my "dice" variable, but all I can do is change the SVM's guesses from all 0's to all 1's.

推荐答案

我明白为什么它总是预测全 1 或全 0.我需要添加这一行:

I figured out why it's always predicting either all 1's or all 0's. I need to add this line:

model.setThreshold(0.5)

这就解决了.使用后我想通了

That fixes it. I figured it out after using

model.clearThreshold()

clearThreshold,然后是预测测试数据,告诉我计算机预测的是浮点数,而不仅仅是我最终要寻找的二进制 0 或 1.我可以看到 SVM 正在做出我认为违反直觉的舍入决定.通过使用 setThreshold,我现在可以获得更好的结果.

clearThreshold, followed by predicting test data, told me what the computer was predicting down to a floating point, and not just to the binary 0 or 1 I'm ultimately looking for. I could see that the SVM was making what I considered a counterintuitive rounding decision. By using setThreshold, I'm now able to get much better results.

这篇关于为什么我的 Spark SVM 总是预测相同的标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆