微调超参数不会提高分类器的得分 [英] Fine Tuning hyperparameters doesn't improve score of classifiers

查看:37
本文介绍了微调超参数不会提高分类器的得分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到一个问题,即使用GridSearchCV微调超参数并不能真正改善我的分类器.我认为改进应该大于此.我在当前代码中获得的分类器的最大改进是在+ -0.03左右.我有一个包含八列和不平衡二进制结果的数据集.为了得分,我使用f1,我使用KFold进行10次拆分.我希望有人能发现什么地方掉了,我应该看看吗?谢谢!

I am experiencing a problem where finetuning the hyperparameters using GridSearchCV doesn't really improve my classifiers. I figured the improvement should be bigger than that. The biggest improvement for a classifier I've gotten with my current code is around +-0.03. I have a dataset with eight columns and an unbalanced binary outcome. For scoring I use f1 and I use KFold with 10 splits. I was hoping if someone could spot something which is off and I should look at? Thank you!

我使用以下代码:

model_parameters = {
    "GaussianNB": {     
    },
    "DecisionTreeClassifier": {
        'min_samples_leaf': range(5, 9),
        'max_depth': [None, 0, 1, 2, 3, 4]
    },
    "KNeighborsClassifier": {
        'n_neighbors': range(1, 10),
        'weights': ["distance", "uniform"]
    },
    "SVM": {
        'kernel': ["poly"],
        'C': np.linspace(0, 15, 30)
    },
    "LogisticRegression": {
        'C': np.linspace(0, 15, 30),
        'penalty': ["l1", "l2", "elasticnet", "none"]
    }
}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
n_splits = 10
scoring_method = make_scorer(lambda true_target, prediction: f1_score(true_target, prediction, average="micro"))
cv = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)

for model_name, parameters in model_parameters.items():

    # Models is a dict with 5 classifiers
    model = models[model_name]
    grid_search = GridSearchCV(model, parameters, cv=cv, n_jobs=-1, scoring=scoring_method, verbose=False).fit(X_train, y_train)
    
    cvScore = cross_val_score(grid_search.best_estimator_, X_test, y_test, cv=cv, scoring='f1').mean()
    classDict[model_name] = cvScore

推荐答案

如果您的班级不平衡,则在执行Kfold时,应保持两个目标之间的比例.

If your classes are unbalanced, when you do Kfold you should keep the proportion between the two targets.

折痕不平衡会导致非常差的结果

Having folds unbalanced can lead to very poor results

检查分层的K折交叉验证器

提供训练/测试索引以将数据拆分为训练/测试集.

Provides train/test indices to split data in train/test sets.

此交叉验证对象是KFold的变体,返回分层褶皱.折叠是通过保留每个班级的样本.

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

还有很多技术可以处理不平衡的数据集.根据上下文:

There are also a lot of techniques to handle unbalanced dataset. Based on the context:

  • 对少数群体进行上采样(例如,使用从sklearn中重新采样)
  • 在对大多数类别进行抽样的情况下(同样, lib 有一些有用的工具可以同时完成不足\上采样)
  • 使用特定的ML模型处理不平衡情况
  • upsampling the minority class (using for example the resample from sklearn)
  • under sampling the majority class (also this lib has some useful tools to do both under\up sampling)
  • handle the unbalance with your specific ML model

例如,在SVC中,创建模型时会有一个参数, class_weight ='balanced'

For example, in SVC, there is an argument when you create the model , class_weight='balanced'

clf_3 = SVC(kernel='linear', 
            class_weight='balanced', # penalize
            probability=True)

这将对少数族裔阶层的错误给予更多惩罚.

which will penalize more the errors on minority class.

您可以这样更改配置:

"SVM": {
        'kernel': ["poly"],
        'C': np.linspace(0, 15, 30),
        'class_weight': 'balanced'

    }

对于LogisticRegression,您可以改为设置权重,以反映类的比例

For LogisticRegression you can set the weights instead, reflecting the proportion of your classes

LogisticRegression(class_weight={0:1, 1:10}) # if problem is a binary one

以这种方式更改网格搜索字典:

changing the grid search dict in such way:

"LogisticRegression": {
        'C': np.linspace(0, 15, 30),
        'penalty': ["l1", "l2", "elasticnet", "none"],
        'class_weight':{0:1, 1:10}
    }

无论如何,该方法取决于所使用的模型.例如,对于神经网络,您可以更改损失函数以通过加权计算(与逻辑回归相同)来惩罚少数类.

Anyway the approach depends on the used model. For neural network for example, you can change the loss function to penalize the minority class with a weighted calculation (the same of the logistic regression)

这篇关于微调超参数不会提高分类器的得分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆