对标签不在训练集中的测试数据使用MultilabelBinarizer [英] Using MultilabelBinarizer on test data with labels not in the training set

查看:74
本文介绍了对标签不在训练集中的测试数据使用MultilabelBinarizer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出这个简单的多标签分类示例(来自此问题,使用scikit-learn分为多个类别)

Given this simple example of multilabel classification (taken from this question, use scikit-learn to classify into multiple categories)

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

X_train = np.array(["new york is a hell of a town",
                "new york was originally dutch",
                "the big apple is great",
                "new york is also called the big apple",
                "nyc is nice",
                "people abbreviate new york city as nyc",
                "the capital of great britain is london",
                "london is in the uk",
                "london is in england",
                "london is in great britain",
                "it rains a lot in london",
                "london hosts the british museum",
                "new york is great and so is london",
                "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],    ["new york"],
            ["new york"],["london"],["london"],["london"],["london"],
            ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
               'welcome to london',
               'london is rainy',
               'it is raining in britian',
               'it is raining in britian and the big apple',
               'it is raining in britian and nyc',
               'hello welcome to new york. enjoy it here and london too'])

y_test_text = [["new york"],["london"],["london"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]


lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)


print "Accuracy Score: ",accuracy_score(Y_test, predicted)

代码运行正常,并打印出准确性得分,但是如果我将y_test_text更改为

The code runs fine, and prints the accuracy score, however if I change y_test_text to

y_test_text = [["new york"],["london"],["england"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]

我知道

Traceback (most recent call last):
  File "/Users/scottstewart/Documents/scikittest/example.py", line 52, in <module>
     print "Accuracy Score: ",accuracy_score(Y_test, predicted)
  File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 181, in accuracy_score
differing_labels = count_nonzero(y_true - y_pred, axis=1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 393, in __sub__
raise ValueError("inconsistent shapes")
ValueError: inconsistent shapes

请注意,该标签中没有引入英国"标签.如何使用多标签分类,以便在引入测试"标签后仍可以运行某些指标?还是有可能?

Notice the introduction of the 'england' label which is not in the training set. How do I use multilabel classification so that if a "test" label is introduced, i can still run some some of metrics? Or is that even possible?

谢谢你们的回答,我想我的问题更多是关于scikit binarizer如何工作或应该如何工作.给定我的简短示例代码,我也希望将y_test_text更改为

Thanks for answers guys, I guess my question is more about how the scikit binarizer works or should work. Given my short sample code, i would also expect if i changed y_test_text to

y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]

那行得通-我的意思是我们已经为那个标签装好了,但在这种情况下,我得到了

That it would work--i mean we have fitted for that label, but in this case I get

ValueError: Can't handle mix of binary and multilabel-indicator

推荐答案

你可以,如果你也在训练 y 集中引入"新标签,像这样:

You can, if you "introduce" the new label in the training y set too, like this:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

X_train = np.array(["new york is a hell of a town",
                "new york was originally dutch",
                "the big apple is great",
                "new york is also called the big apple",
                "nyc is nice",
                "people abbreviate new york city as nyc",
                "the capital of great britain is london",
                "london is in the uk",
                "london is in england",
                "london is in great britain",
                "it rains a lot in london",
                "london hosts the british museum",
                "new york is great and so is london",
                "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],    
                ["new york"],["new york"],["london"],["london"],         
                ["london"],["london"],["london"],["london"],
                ["new york","England"],["new york","london"]]

X_test = np.array(['nice day in nyc',
               'welcome to london',
               'london is rainy',
               'it is raining in britian',
               'it is raining in britian and the big apple',
               'it is raining in britian and nyc',
               'hello welcome to new york. enjoy it here and london too'])

y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]


lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)

print Y_test

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
print predicted

print "Accuracy Score: ",accuracy_score(Y_test, predicted)

输出:

Accuracy Score:  0.571428571429

关键部分是:

y_train_text = [["new york"],["new york"],["new york"],
                ["new york"],["new york"],["new york"],
                ["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","England"],
                ["new york","london"]]

我们也在其中插入了英格兰".这是有道理的,因为如果他以前没有看到分类器,则可以通过其他方式预测分类器吗?因此,我们以此方式创建了一个三标签分类问题.

Where we inserted "England" too. It makes sense, because other way how can predict the classifier some label if he didn't see it before? So we created a three label classification problem this way.

已编辑

lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))

您必须将类作为 arg 传递给 MultiLabelBinarizer(),它可以处理任何 y_test_text.

You have to pass the classes as arg to MultiLabelBinarizer() and it will work with any y_test_text.

这篇关于对标签不在训练集中的测试数据使用MultilabelBinarizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆