XgBoost:y中人口最少的类只有1个成员,这太少了 [英] XgBoost : The least populated class in y has only 1 members, which is too few

查看:235
本文介绍了XgBoost:y中人口最少的类只有1个成员,这太少了的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在sklearn上使用Xgboost实现技术来进行kaggle的比赛。
但是,我收到此警告消息:

Im using Xgboost implementation on sklearn for a kaggle's competition. However, im getting this 'warning' message :

$ python Script1.py
/home/sky/private/virtualenv15.0.1dev/ myVE / local / lib / python2.7 / site-packages / sklearn / cross_validation.py:516:

$ python Script1.py /home/sky/private/virtualenv15.0.1dev/myVE/local/lib/python2.7/site-packages/sklearn/cross_validation.py:516:

警告:y中人口最少的类只有1个成员,这太少了。任何类别的最小标签数不能小于n_folds = 3。
%(min_labels,self.n_folds)),警告)

根据stackoverflow的另一个问题:
检查您是否每个类别至少有3个样本,能够进行k == 3的StratifiedKFold交叉验证(我认为这是GridSearchCV用于分类的默认CV)。

According to another question on stackoverflow : "Check that you have at least 3 samples per class to be able to do StratifiedKFold cross validation with k == 3 (I think this is the default CV used by GridSearchCV for classification)."

而且,我每个班级至少没有3个样本。

And well, i dont have at least 3 samples per class.

所以我的问题是:

a)有什么选择?

b)为什么我不能使用交叉验证?

c)我可以改用什么?

...
param_test1 = {
    'max_depth': range(3, 10, 2),
    'min_child_weight': range(1, 6, 2)
}

grid_search = GridSearchCV(

estimator=
XGBClassifier(
    learning_rate=0.1,
    n_estimators=3000,
    max_depth=15,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='multi:softmax',
    nthread=42,
    scale_pos_weight=1,
    seed=27),

    param_grid=param_test1, scoring='roc_auc', n_jobs=42, iid=False, cv=None, verbose=1)
...

grid_search.fit(train_x, place_id)

参考文献:

使用scikit-learn进行一次学习

在scikit-learn中将支持向量分类器与多项式内核一起使用

推荐答案

一个只有一个样本的目标/类别,对于任何模型来说都太少了。您可以做的是获取另一个数据集,最好尽可能保持平衡,因为大多数模型在平衡集中的表现都更好。

If you have a target/class with only one sample, thats too few for any model. What you can do is get another dataset, preferably as balanced as possible, since most models behave better in balanced sets.

如果您无法拥有另一个数据集,则必须玩弄你所拥有的。我建议您删除具有孤独目标的样本。因此,您将拥有一个不涵盖该目标的模型。如果这不符合您的要求,则需要一个新的数据集。

If you cannot have another dataset, you will have to play with what you have. I would suggest you remove the sample that has the lonely target. So you will have a model which does not cover that target. If that does not fit you requirements, you need a new dataset.

这篇关于XgBoost:y中人口最少的类只有1个成员,这太少了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆