无法为多标签分类器进行堆叠 [英] Unable to do Stacking for a Multi-label classifier
问题描述
我正在研究多标签文本分类问题(目标标签总数90).数据分布具有长尾和类不平衡以及大约10万条记录的情况.我正在使用OAA策略(反对所有人).我正在尝试使用Stacking创建一个合奏.
I am working on a multi-label text classification problem (Total target labels 90). The data distribution has a long tail and class imbalance and around 100k records. I am using the OAA strategy (One against all). I am trying to create an ensemble using Stacking.
文本功能: HashingVectorizer
(功能数量2 ** 20,字符分析器)
TSVD 降低维度(n_components=200).
Text features : HashingVectorizer
(number of features 2**20, char analyzer)
TSVD to reduce the dimensionality (n_components=200).
text_pipeline = Pipeline([
('hashing_vectorizer', HashingVectorizer(n_features=2**20,
analyzer='char')),
('svd', TruncatedSVD(algorithm='randomized',
n_components=200, random_state=19204))])
feat_pipeline = FeatureUnion([('text', text_pipeline)])
estimators_list = [('ExtraTrees',
OneVsRestClassifier(ExtraTreesClassifier(n_estimators=30,
class_weight="balanced",
random_state=4621))),
('linearSVC',
OneVsRestClassifier(LinearSVC(class_weight='balanced')))]
estimators_ensemble = StackingClassifier(estimators=estimators_list,
final_estimator=OneVsRestClassifier(
LogisticRegression(solver='lbfgs',
max_iter=300)))
classifier_pipeline = Pipeline([
('features', feat_pipeline),
('clf', estimators_ensemble)])
错误
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-ad4e769a0a78> in <module>()
1 start = time.time()
----> 2 classifier_pipeline.fit(X_train.values, y_train_encoded)
3 print(f"Execution time {time.time()-start}")
4
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
795 return np.ravel(y)
796
--> 797 raise ValueError("bad input shape {0}".format(shape))
798
799
ValueError: bad input shape (89792, 83)
推荐答案
StackingClassifier
目前不支持多标签分类.您可以通过查看 fit
参数的形状值来了解这些功能,例如
StackingClassifier
does not support multi label classification as of now. You could get to understand these functionalities by looking at the shape value for the fit
parameters such as here.
解决方案是将OneVsRestClassifier包装器放在 StackingClassifier
的顶部,而不是放在各个模型上.
Solution would be to put the OneVsRestClassifier wrapper on top of StackingClassifier
rather on the individual models.
示例:
from sklearn.datasets import make_multilabel_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import StackingClassifier
from sklearn.multiclass import OneVsRestClassifier
X, y = make_multilabel_classification(n_classes=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33,
random_state=42)
estimators_list = [('ExtraTrees', ExtraTreesClassifier(n_estimators=30,
class_weight="balanced",
random_state=4621)),
('linearSVC', LinearSVC(class_weight='balanced'))]
estimators_ensemble = StackingClassifier(estimators=estimators_list,
final_estimator = LogisticRegression(solver='lbfgs', max_iter=300))
ovr_model = OneVsRestClassifier(estimators_ensemble)
ovr_model.fit(X_train, y_train)
ovr_model.score(X_test, y_test)
# 0.45454545454545453
这篇关于无法为多标签分类器进行堆叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!