Scikit-Learn:所有培训示例中都没有标签x [英] Scikit-Learn: Label not x is present in all training examples

查看:89
本文介绍了Scikit-Learn:所有培训示例中都没有标签x的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用SVM进行多标签分类. 我有将近8k的特征,并且也有y向量的长度,其长度接近400.我已经将二值化的Y向量进行了二进制处理,因此我没有使用MultiLabelBinarizer(),但是当我将其与Y数据的原始格式一起使用时,它仍然可以提供相同的功能.

I'm trying to do multilabel classification with SVM. I have nearly 8k features and also have y vector of length with nearly 400. I already have binarized Y vectors, so I didn't use MultiLabelBinarizer() but when I use it with my Y data's raw form, it still gives same thing.

我正在运行以下代码:

X = np.genfromtxt('data_X', delimiter=";")
Y = np.genfromtxt('data_y', delimiter=";")
training_X = X[:2600,:]
training_y = Y[:2600,:]

test_sample = X[2600:2601,:]
test_result = Y[2600:2601,:]

classif = OneVsRestClassifier(SVC(kernel='rbf'))
classif.fit(training_X, training_y)
print(classif.predict(test_sample))
print(test_result)

在对预测部分进行所有拟合后,说Label not x is present in all training examples(x在我的y向量长度范围为400的范围内是几个不同的数字).之后,它给出预测的y向量,该向量始终为零向量,长度为400(y向量长度). 我是scikit-learn和机器学习领域的新手.我在这里找不到问题.有什么问题,我应该怎么做才能解决? 谢谢.

After all fitting process when it comes to prediction part, it says Label not x is present in all training examples (x is a few different numbers in range of my y vector length which is 400). After that it gives predicted y vector which is always zero vector with length of 400(y vector length). I'm new at scikit-learn and also in machine learning. I couldn't figure out the problem here. What's the problem and what should I do to fix it? Thanks.

推荐答案

这里有2个问题:

1)缺少标签警告
2)您得到全0的预测

1) The missing label warning
2) You are getting all 0's for predictions

警告表示训练数据中缺少某些班级.这是一个普遍的问题.如果您有400个类别,那么其中某些类别只能很少出现,并且在数据的任何拆分中,拆分的一侧可能会缺少某些类别.可能还有一些根本根本不在您的数据中出现的类.您可以尝试Y.sum(axis=0).all(),如果为False,那么即使在Y中也不会出现某些类.这听起来很可怕,但实际上,您将无法正确预测出现0、1或任何其他类的类.无论如何,次数很少,因此预测这些次数为0可能是您可以做的最好的事情.

The warning means that some of your classes are missing from the training data. This is a common problem. If you have 400 classes, then some of them must only occur very rarely, and on any split of the data, some classes may be missing from one side of the split. There may also be classes that simply don't occur in your data at all. You could try Y.sum(axis=0).all() and if that is False, then some classes do not occur even in Y. This all sounds horrible, but realistically, you aren't going to be able to correctly predict classes that occur 0, 1, or any very small number of times anyway, so predicting 0 for those is probably about the best you can do.

关于全零预测,我将指出,对于400个类,可能所有类的发生时间都少于一半.您可以检查Y.mean(axis=0).max()以获得最高的标签频率.有400个课程,可能只有百分之几.如果是这样,则必须对每个类进行0-1预测的二进制分类器将为所有实例上的所有类选择0.这并不是真正的错误,只是因为所有的班级频率都很低.

As for the all-0 predictions, I'll point out that with 400 classes, probably all of your classes occur much less than half the time. You could check Y.mean(axis=0).max() to get the highest label frequency. With 400 classes, it might only be a few percent. If so, a binary classifier that has to make a 0-1 prediction for each class will probably pick 0 for all classes on all instances. This isn't really an error, it is just because all of the class frequencies are low.

如果您知道每个实例都有一个正标号(至少一个),则可以获取决策值(clf.decision_function),并为每个实例选择具有最高标号的类.不过,您必须编写一些代码才能做到这一点.

If you know that each instance has a positive label (at least one), you could get the decision values (clf.decision_function) and pick the class with the highest one for each instance. You'll have to write some code to do that, though.

我曾经在类似的Kaggle比赛中获得前10名.这是一个约200个类别的多标签问题,即使以10%的频率出现,也没有一个发生,因此我们需要0-1个预测.在这种情况下,我得到了决策值,并采用了最高的决策值,以及超过阈值的所有决策值.我选择了在保留设置上效果最好的阈值.该条目的代码在Github上:凝视希腊媒体代码.您可能会看一下.

I once had a top-10 finish in a Kaggle contest that was similar to this. It was a multilabel problem with ~200 classes, none of which occurred with even a 10% frequency, and we needed 0-1 predictions. In that case I got the decision values and took the highest one, plus anything that was above a threshold. I chose the threshold that worked the best on a holdout set. The code for that entry is on Github: Kaggle Greek Media code. You might take a look at it.

如果您到目前为止已完成,请多谢阅读.希望有帮助.

If you made it this far, thanks for reading. Hope that helps.

这篇关于Scikit-Learn:所有培训示例中都没有标签x的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆