y_test,sklearn多标签分类上的形状误差不一致MultiLabelBinarizer [英] inconsistent shape error MultiLabelBinarizer on y_test, sklearn multi-label classification
问题描述
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
data = r'C:\Users\...\Downloads\news_v1.xlsx'
df = pd.read_excel(data)
df = pd.DataFrame(df.groupby(["id", "doc"]).label.apply(list)).reset_index()
X = np.array(df.doc)
y = np.array(df.label)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
mlb = preprocessing.MultiLabelBinarizer()
Y_train = mlb.fit_transform(y_train)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, Y_train)
predicted = classifier.predict(X_test)
Y_test = mlb.fit_transform(y_test)
print("Y_train: ", Y_train.shape)
print("Y_test: ", Y_test.shape)
print("Predicted: ", predicted.shape)
print("Accuracy Score: ", accuracy_score(Y_test, predicted))
我似乎无法进行任何测量,因为在用MultiLabelBinarizer进行fit_transform后,Y_test给出了不同的矩阵尺寸.
I can't seems to do any measurements since Y_test gives a different matrix dimension after fit_transform with MultiLabelBinarizer.
结果和错误:
Y_train: (1278, 49)
Y_test: (630, 42)
Predicted: (630, 49)
Traceback (most recent call last):
File "C:/Users/../PycharmProjects/MultiAutoTag/classifier.py", line 41, in <module>
print("Accuracy Score: ", accuracy_score(Y_test, predicted))
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\classification.py", line 174, in accuracy_score
differing_labels = count_nonzero(y_true - y_pred, axis=1)
File "C:\ProgramData\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 361, in __sub__
raise ValueError("inconsistent shapes")
ValueError: inconsistent shapes
查看打印的Y_test,形状与其余形状不同.我在做什么错,为什么MultiLabelBinarizer为Y_test返回不同的大小? 感谢您的提前帮助!
Looking at the printed Y_test, the shape is different than the rest. What am i doing wrong and why does MultiLabelBinarizer return a different size for Y_test? Thanks for the help in advance!
编辑 新错误:
Traceback (most recent call last):
File "C:/Users/../PycharmProjects/MultiAutoTag/classifier.py", line 47, in <module>
Y_test = mlb.transform(y_test)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 763, in transform
yt = self._transform(y, class_to_index)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 787, in _transform
indices.extend(set(class_mapping[label] for label in labels))
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 787, in <genexpr>
indices.extend(set(class_mapping[label] for label in labels))
KeyError: 'Sanction'
这是y_test的样子:
This is how y_test looks like:
print(y_test)
[['App'] ['Contract'] ['Pay'] ['App']
['App'] ['App']
['Reports'] ['Reports'] ['Executive', 'Pay']
['Change'] ['Reports']
['Reports'] ['Issue']]
推荐答案
您应仅对测试数据调用transform()
.切勿使用fit()
或其变体(例如fit_transform()
或fit_predict()
等).只能在训练数据上使用它们.
You should only call transform()
on test data. Never fit()
or its variations like fit_transform()
or fit_predict()
etc. They should be used only on training data.
因此更改行:
Y_test = mlb.fit_transform(y_test)
到
Y_test = mlb.transform(y_test)
说明:
当您调用fit()
或fit_transform()
时,mlb会忘记其先前学习的数据并学习新提供的数据.当Y_train
和Y_test
的标签可能因您的情况而有所不同时,这可能会出现问题.
When you call fit()
or fit_transform()
, the mlb forgets its previous learnt data and learn the new supplied data. This can be problematic when Y_train
and Y_test
may have difference in labels as your case have.
在您的情况下,Y_train
具有49种不同的标签,而Y_test
仅具有42种不同的标签.但这并不意味着Y_test比Y_train
短7个标签. Y_test
可能具有完全不同的标签集,当二值化结果为42列时,这会影响结果.
In your case, Y_train
have 49 different kinds of labels, whereas Y_test
have only 42 different labels. But this doesn't mean that Y_test is 7 labels short of Y_train
. It can be possible that Y_test
may have entirely different set of labels, which when binarized results in 42 columns, and that will affect the results.
这篇关于y_test,sklearn多标签分类上的形状误差不一致MultiLabelBinarizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!