Python sklearn多标签分类:用户警告:在所有培训示例中均未包含标签226 [英] Python sklearn Multilabel Classification : UserWarning: Label not 226 is present in all training examples

本文介绍了Python sklearn多标签分类:用户警告:在所有培训示例中均未包含标签226的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试多标签分类"问题.我的数据看起来像这样

I am trying out a Multilabel Classification problem. My data looks like this

DocID   Content             Tags           
1       some text here...   [70]
2       some text here...   [59]
3       some text here...  [183]
4       some text here...  [173]
5       some text here...   [71]
6       some text here...   [98]
7       some text here...  [211]
8       some text here...  [188]
.       .............      .....
.       .............      .....
.       .............      .....

这是我的代码

traindf = pd.read_csv("mul.csv")
print "This is what our training data looks like:"
print traindf

t=TfidfVectorizer()

X=traindf["Content"]

y=traindf["Tags"]

print "Original Content"
print X
X=t.fit_transform(X)
print "Content After transformation"
print X
print "Original Tags"
print y
y=MultiLabelBinarizer().fit_transform(y)
print "Tags After transformation"
print y

print "Features extracted:"
print t.get_feature_names()
print "Scores of features extracted"
idf = t.idf_
print dict(zip(t.get_feature_names(), idf))

print "Splitting into training and validation sets..."
Xtrain, Xvalidate, ytrain, yvalidate = train_test_split(X, y, test_size=.5)

print "Training Set Content and Tags"
print Xtrain
print ytrain
print "Validation Set Content and Tags"
print Xvalidate
print yvalidate

print "Creating classifier"
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01))

clf.fit(Xtrain, ytrain)

predictions=clf.predict(Xvalidate)
print "Predicted Tags are:"
print predictions
print "Correct Tags on Validation Set are :"
print yvalidate
print "Accuracy on validation set: %.3f"  % clf.score(Xvalidate,yvalidate)

代码运行正常,但我不断收到这些消息

the code runs fine but i keep getting these messages

X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 288 is present in all training examples.
  str(classes[c]))
X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 304 is present in all training examples.
  str(classes[c]))
X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 340 is present in all training examples.

这是什么意思?它表明我的数据不够多样化吗?

what does this mean? does it show that my data is not diverse enough?

推荐答案

当某些项目出现在所有或许多记录中时,某些数据挖掘算法就会出现问题.例如,这是使用Apriori算法进行关联规则挖掘时的问题.

Some data mining algorithms have problems when some items are present in all or many records. This is for example an issue when doing association rule mining using the Apriori algorithm.

是否存在问题取决于分类器.我不知道您使用的特定分类器,但这是一个示例,它在拟合具有最大深度的决策树时可能很重要.

Whether it is a problem or not depends on the classifier. I don't know the particular classifier you're using, but here's an example when it could matter when fitting a decision tree with a maximum depth.

假设您使用Hunt算法和GINI索引拟合最大深度的决策树,以确定最佳分割(请参见

Say you are fitting a decision tree with max depth using Hunt's algorithm and the GINI index to determine the best split (see here for an explanation, slide 35 onwards). A first split could be on whether or not the record has label 288. If every record has this label, the GINI index will be optimal for such a split. This means that the first so many splits will be useless, because you're not actually splitting the training set (you're splitting in an empty set, without 288, and the set itself, with 288). So, the first so many levels of the tree are useless. If you then set a maximum depth, this could result in a low-accuracy decision tree.

在任何情况下,您得到的警告都不是代码的问题,充其量是数据集.您应该检查所使用的分类器是否对这种情况敏感–如果是这样,当您滤除随处可见的标签时,它可能会产生更好的结果.

In any case, the warning you get is not a problem with your code, at best with your data set. You should check whether or not the classifier you're using is sensitive to this kind of things – if it is, it may give better results when you filter out the labels that occur everywhere.

这篇关于Python sklearn多标签分类:用户警告:在所有培训示例中均未包含标签226的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆