为什么xgboost.cv和sklearn.cross_val_score给出不同的结果? [英] Why xgboost.cv and sklearn.cross_val_score give different results?

查看:311
本文介绍了为什么xgboost.cv和sklearn.cross_val_score给出不同的结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对数据集进行分类.我首先使用了XGBoost:

I'm trying to make a classifier on a data set. I first used XGBoost:

import xgboost as xgb
import pandas as pd
import numpy as np

train = pd.read_csv("train_users_processed_onehot.csv")
labels = train["Buy"].map({"Y":1, "N":0})

features = train.drop("Buy", axis=1)
data_dmat = xgb.DMatrix(data=features, label=labels)

params={"max_depth":5, "min_child_weight":2, "eta": 0.1, "subsamples":0.9, "colsample_bytree":0.8, "objective" : "binary:logistic", "eval_metric": "logloss"}
rounds = 180

result = xgb.cv(params=params, dtrain=data_dmat, num_boost_round=rounds, early_stopping_rounds=50, as_pandas=True, seed=23333)
print result

结果是:

        test-logloss-mean  test-logloss-std  train-logloss-mean  
0             0.683539          0.000141            0.683407
179           0.622302          0.001504            0.606452  

我们可以看到它在0.622左右;

We can see it is around 0.622;

但是当我使用完全相同的参数切换到sklearn(我认为)时,结果却大不相同.下面是我的代码:

But when I switch to sklearn using the exactly same parameters(I think), the result is quite different. Below is my code:

from sklearn.model_selection import cross_val_score
from xgboost.sklearn import XGBClassifier
import pandas as pd

train_dataframe = pd.read_csv("train_users_processed_onehot.csv")
train_labels = train_dataframe["Buy"].map({"Y":1, "N":0})
train_features = train_dataframe.drop("Buy", axis=1)

estimator = XGBClassifier(learning_rate=0.1, n_estimators=190, max_depth=5, min_child_weight=2, objective="binary:logistic", subsample=0.9, colsample_bytree=0.8, seed=23333)
print cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss")

,结果为:[-4.11429976 -2.08675843 -3.27346662],反转后仍远离0.622.

and the result is:[-4.11429976 -2.08675843 -3.27346662], after reversing it is still far from 0.622.

我将一个断点扔到了cross_val_score中,发现分类器通过尝试以约0.99的概率预测测试集中的每个元组为负来做出疯狂的预测.

I tossed a break point into cross_val_score, and saw that the classifier is making crazy predictions by trying to predict every tuple in the test set to be negative with about 0.99 probability.

我想知道我哪里出错了.有人可以帮我吗?

I'm wondering where have I gone wrong. Could someone help me?

推荐答案

这个问题有点老了,但是我今天遇到了这个问题,并弄清楚了为什么xgboost.cvsklearn.model_selection.cross_val_score给出的结果有很大的不同.

This question is a bit old, but I ran into the problem today and figured out why the results given by xgboost.cv and sklearn.model_selection.cross_val_score are quite different.

默认情况下,cross_val_score使用KFoldStratifiedKFold,其shuffle参数为False,因此不会从数据中随机抽取折痕.

By default cross_val_score use KFold or StratifiedKFold whose shuffle argument is False so the folds are not pulled randomly from the data.

因此,如果执行此操作,则应该获得相同的结果:

So if you do this, then you should get the same results:

cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss",
    cv = StratifiedKFold(shuffle=True, random_state=23333))

StratifiedKfold中的random statexgboost.cv中的seed保持相同,以得到完全可重复的结果.

Keep the random state in StratifiedKfold and seed in xgboost.cv same to get exactly reproducible results.

这篇关于为什么xgboost.cv和sklearn.cross_val_score给出不同的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆