UserWarning:所有培训示例中都没有标签:: NUMBER: [英] UserWarning: Label not :NUMBER: is present in all training examples

查看:115
本文介绍了UserWarning:所有培训示例中都没有标签:: NUMBER:的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在进行多标签分类,在这里我尝试预测每个文档的正确标签,这是我的代码:

I am doing multilabel classification, where I try to predict correct labels for each document and here is my code:

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df = 0.8, 
                                   min_df = 10)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

predicted = cross_val_predict(classifier, X, y)

运行代码时,我会收到多个警告:

When running my code I get multiple warnings:

UserWarning: Label not :NUMBER: is present in all training examples.

当我打印出预测标签和真实标签时,所有文档中cca的一半都对标签为空。

When I print out predicted and true labels, cca half of all documents has it's predictions for labels empty.

为什么会发生这种情况,它与训练运行时打印出来的警告有关吗?如何避免这些空洞的预测?



EDIT01:
当使用除 LinearSVC()以外的其他估算器时,也会发生这种情况。

Why is this happening, is it related to warnings it prints out while training is running? How can I avoid those empty predictions?


EDIT01: This is also happening when using other estimators than LinearSVC().

我尝试了 RandomForestClassifier(),它也给出了空的预测。奇怪的是,当我使用 cross_val_predict(classifier,X,y,method ='predict_proba')来预测每个标签的概率,而不是二进制决策0/1时,对于给定文档,每个预测集始终至少有一个标签,概率> 0。因此,我不知道为什么不使用二进制决策选择此标签?还是用不同于概率的方式评估二进制决策?

I've tried RandomForestClassifier() and it gives empty predictions as well. Strange thing is, when I use cross_val_predict(classifier, X, y, method='predict_proba') for predicting probabilities for each label, instead of binary decisions 0/1, there is always at least one label per predicted set with probability > 0 for given document. So I dont know why is this label not chosen with binary decisioning? Or is binary decisioning evaluated in different way than probabilities?

EDIT02:
我发现了一个旧的帖子,其中OP正在处理类似的问题。

EDIT02: I have found an old post where OP was dealing with similar problem. Is this the same case?

推荐答案


为什么会这样,与打印的警告有关吗?

Why is this happening, is it related to warnings it prints out while training is running?

问题很可能是某些标签仅出现在一些文档中(请查看此线程以获取详细信息)。当您将数据集分为训练和测试以验证模型时,训练数据中可能会缺少一些标签。假设 train_indices 是带有训练样本索引的数组。如果训练样本中未出现特定标签(索引为 k ),则 k -中的所有元素指标矩阵 y [train_indices] 的第零列为零。

The issue is likely to be that some tags occur just in a few documents (check out this thread for details). When you split the dataset into train and test to validate your model, it may happen that some tags are missing from the training data. Let train_indices be an array with the indices of the training samples. If a particular tag (of index k) does not occur in the training sample, all the elements in the k-th column of the indicator matrix y[train_indices] are zeros.


如何避免这些空洞的预测?

How can I avoid those empty predictions?

在上述情况下,分类器将无法可靠地预测 k -th标签(在下一段中对此有更多的介绍)。因此,您不能信任 clf.predict 所作的预测,并且您需要自己实现预测功能,例如使用 clf.decision_function ,如这个答案

In the scenario described above the classifier will not be able to reliably predict the k-th tag in the test documents (more on this in the next paragraph). Therefore you cannot trust the predictions made by clf.predict and you need to implement the prediction function on your own, for example by using the decision values returned by clf.decision_function as suggested in this answer.


所以我不知道为什么不选择该标签二元决策?还是用与概率不同的方式评估二进制决策?

So I don't know why is this label not chosen with binary decisioning? Or is binary decisioning evaluated in different way than probabilities?

在包含许多标签的数据集中,大多数标签的出现频率通常是低。如果将这些低值提供给二进制分类器(即做出0-1预测的分类器),则该分类器极有可能为所有文档上的所有标签选择0。

In datasets containing many labels the occurrence frequency for most of them uses to be rather low. If these low values are fed to a binary classifier (i.e. a classifier that makes a 0-1 prediction) it is highly probable that the classifier would pick 0 for all tags on all documents.


我发现了一篇旧文章,OP正在处理类似的问题。

I have found an old post where OP was dealing with similar problem. Is this the same case?

是的,是绝对的。那个人正面临着与您完全相同的问题,他的代码与您的代码非常相似。

Yes, absolutely. That guy is facing exactly the same problem as you and his code is pretty similar to yours.

Demo

Demo

为进一步说明该问题,我使用模拟数据阐述了一个简单的玩具示例。

To further explain the issue I have elaborated a simple toy example using mock data.

Q = {'What does the "yield" keyword do in Python?': ['python'],
     'What is a metaclass in Python?': ['oop'],
     'How do I check whether a file exists using Python?': ['python'],
     'How to make a chain of function decorators?': ['python', 'decorator'],
     'Using i and j as variables in Matlab': ['matlab', 'naming-conventions'],
     'MATLAB: get variable type': ['matlab'],
     'Why is MATLAB so fast in matrix multiplication?': ['performance'],
     'Is MATLAB OOP slow or am I doing something wrong?': ['matlab-oop'],
    }
dataframe = pd.DataFrame({'body': Q.keys(), 'tag': Q.values()})    

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df=0.8, 
                                   min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

请注意,我已将 min_df = 1 ,因为我的数据集比您的数据集小得多。当我运行以下句子时:

Please, notice that I have set min_df=1 since my dataset is much smaller than yours. When I run the following sentence:

predicted = cross_val_predict(classifier, X, y)

我收到一堆警告

C:\...\multiclass.py:76: UserWarning: Label not 4 is present in all training examples.
  str(classes[c]))
C:\\multiclass.py:76: UserWarning: Label not 0 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 3 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 5 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 2 is present in all training examples.
  str(classes[c]))

以及以下预测:

In [5]: np.set_printoptions(precision=2, threshold=1000)    

In [6]: predicted
Out[6]: 
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])

条目全部为 0 的那些行表示没有为相应文档预测标签。

Those rows whose entries are all 0 indicate that no tag is predicted for the corresponding document.

解决方法

Workaround

为便于分析,让我们验证一下而不是通过 cross_val_predict 手动创建模型。

For the sake of the analysis, let us validate the model manually rather than through cross_val_predict.

import warnings
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=1, test_size=.5, random_state=0)
train_indices, test_indices = rs.split(X).next()

with warnings.catch_warnings(record=True) as received_warnings:
    warnings.simplefilter("always")
    X_train, y_train = X[train_indices], y[train_indices]
    X_test, y_test = X[test_indices], y[test_indices]
    classifier.fit(X_train, y_train)
    predicted_test = classifier.predict(X_test)
    for w in received_warnings:
        print w.message

执行上面的代码段时,会发出两个警告(我使用上下文管理器确保捕获到警告):

When the snippet above is executed two warnings are issued (I used a context manager to make sure warnings are catched):

Label not 2 is present in all training examples.
Label not 4 is present in all training examples.

这与索引 2 4

In [40]: y_train
Out[40]: 
array([[0, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1]])

对于某些文档,预测为空(那些文档与<$ c $中全为零的行相对应c> predicted_test ):

For some documents, the prediction is empty (those documents corresponding to the rows with all zeros in predicted_test):

In [42]: predicted_test
Out[42]: 
array([[0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0]])

要解决该问题,可以实现自己的预测功能,如下所示:

To overcome that issue, you could implement your own prediction function like this:

def get_best_tags(clf, X, lb, n_tags=3):
    decfun = clf.decision_function(X)
    best_tags = np.argsort(decfun)[:, :-(n_tags+1): -1]
    return lb.classes_[best_tags]

这样做,每个文档始终被分配具有最高置信度得分的 n_tag 标签:

By doing so, each document is always assigned the n_tag tags with the highest confidence score:

In [59]: mlb.inverse_transform(predicted_test)
Out[59]: [('matlab',), (), (), ('matlab', 'naming-conventions')]

In [60]: get_best_tags(classifier, X_test, mlb)
Out[60]: 
array([['matlab', 'oop', 'matlab-oop'],
       ['oop', 'matlab-oop', 'matlab'],
       ['oop', 'matlab-oop', 'matlab'],
       ['matlab', 'naming-conventions', 'oop']], dtype=object)

这篇关于UserWarning:所有培训示例中都没有标签:: NUMBER:的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆