无法为多标签分类器进行堆叠 [英] Unable to do Stacking for a Multi-label classifier

查看:144
本文介绍了无法为多标签分类器进行堆叠的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究多标签文本分类问题(目标标签总数90).数据分布具有长尾和类不平衡以及大约10万条记录的情况.我正在使用OAA策略(反对所有人).我正在尝试使用Stacking创建一个合奏.

I am working on a multi-label text classification problem (Total target labels 90). The data distribution has a long tail and class imbalance and around 100k records. I am using the OAA strategy (One against all). I am trying to create an ensemble using Stacking.

文本功能: HashingVectorizer (功能数量2 ** 20,字符分析器)
TSVD 降低维度(n_components=200).

Text features : HashingVectorizer(number of features 2**20, char analyzer)
TSVD to reduce the dimensionality (n_components=200).

text_pipeline = Pipeline([
    ('hashing_vectorizer', HashingVectorizer(n_features=2**20,
                                             analyzer='char')),
    ('svd', TruncatedSVD(algorithm='randomized',
                         n_components=200, random_state=19204))])

feat_pipeline = FeatureUnion([('text', text_pipeline)])

estimators_list = [('ExtraTrees',
                    OneVsRestClassifier(ExtraTreesClassifier(n_estimators=30,
                                                             class_weight="balanced",
                                                             random_state=4621))),
                   ('linearSVC',
                    OneVsRestClassifier(LinearSVC(class_weight='balanced')))]
estimators_ensemble = StackingClassifier(estimators=estimators_list,
                                         final_estimator=OneVsRestClassifier(
                                             LogisticRegression(solver='lbfgs',
                                                                max_iter=300)))

classifier_pipeline = Pipeline([
    ('features', feat_pipeline),
    ('clf', estimators_ensemble)])

错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-ad4e769a0a78> in <module>()
      1 start = time.time()
----> 2 classifier_pipeline.fit(X_train.values, y_train_encoded)
      3 print(f"Execution time {time.time()-start}")
      4 

3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
    795         return np.ravel(y)
    796 
--> 797     raise ValueError("bad input shape {0}".format(shape))
    798 
    799 

ValueError: bad input shape (89792, 83)

推荐答案

StackingClassifier 目前不支持多标签分类.您可以通过查看 fit 参数的形状值来了解这些功能,例如

StackingClassifier does not support multi label classification as of now. You could get to understand these functionalities by looking at the shape value for the fit parameters such as here.

解决方案是将OneVsRestClassifier包装器放在 StackingClassifier 的顶部,而不是放在各个模型上.

Solution would be to put the OneVsRestClassifier wrapper on top of StackingClassifier rather on the individual models.

示例:

from sklearn.datasets import make_multilabel_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import StackingClassifier
from sklearn.multiclass import OneVsRestClassifier

X, y = make_multilabel_classification(n_classes=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33,
                                                    random_state=42)

estimators_list = [('ExtraTrees', ExtraTreesClassifier(n_estimators=30, 
                                                       class_weight="balanced", 
                                                       random_state=4621)),
                   ('linearSVC', LinearSVC(class_weight='balanced'))]

estimators_ensemble = StackingClassifier(estimators=estimators_list,
                                         final_estimator = LogisticRegression(solver='lbfgs', max_iter=300))

ovr_model = OneVsRestClassifier(estimators_ensemble)

ovr_model.fit(X_train, y_train)
ovr_model.score(X_test, y_test)

# 0.45454545454545453

这篇关于无法为多标签分类器进行堆叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆