XgBoost脚本无法正确输出二进制文件 [英] XgBoost Script is not outputing binary properly

查看:136
本文介绍了XgBoost脚本无法正确输出二进制文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习使用xgboost,并且已经阅读了文档! 但是,我不明白为什么我的脚本的输出在0~~2之间出现. 首先,我认为它应该为0或1,因为它是binary分类,但是随后,我读到它的出现概率为0或1,但是,有些输出是1.5+(至少在CSV上),这对我来说毫无意义!

I'm learning to use xgboost, and I have read through the documentation! However, I'm not understanding why the output of my script is coming out between 0~~2. First, I thought it should come as either 0 or 1, since its a binary classification, but then, I read it comes as a probability of 0 or 1, however, some outputs are 1.5+ ( at least on the CSV ), which doesnt make sense to me!

我不确定问题是在xgboost参数上还是在csv创建中! 这行np.expm1(preds),我不确定它应该是np.expm1,但是我不知道该怎么做!

I'm unsure if the problem is on xgboost parameters or in the csv creation! This line, np.expm1(preds) , im not sure it should be np.expm1, but I dont know for what I could change it!

总而言之,我的问题是:

In conclusion, my question is :

为什么输出不是0或1,而是输出为0.0xxx和1.xxx?

这是我的剧本:

import numpy as np
import xgboost as xgb
import pandas as pd

train = pd.read_csv('../dataset/train.csv')
train = train.drop('ID', axis=1)

y = train['TARGET']

train = train.drop('TARGET', axis=1)
x = train

dtrain = xgb.DMatrix(x.as_matrix(), label=y.tolist())

test = pd.read_csv('../dataset/test.csv')

test = test.drop('ID', axis=1)
dtest = xgb.DMatrix(test.as_matrix())


# XGBoost params:
def get_params():
    #
    params = {}
    params["objective"] = "binary:logistic"
    params["booster"] = "gbtree"
    params["eval_metric"] = "auc"
    params["eta"] = 0.3  #
    params["subsample"] = 0.50
    params["colsample_bytree"] = 1.0
    params["max_depth"] = 20
    params["nthread"] = 4
    plst = list(params.items())
    #
    return plst


bst = xgb.train(get_params(), dtrain, 1000)

preds = bst.predict(dtest)

print np.max(preds)
print np.min(preds)
print np.average(preds)

# Make Submission
test_aux = pd.read_csv('../dataset/test.csv')
result = pd.DataFrame({"Id": test_aux["ID"], 'TARGET': np.expm1(preds)})

result.to_csv("xgboost_submission.csv", index=False)

推荐答案

使用目标binary:logistic运行xgb模型时,您将获得每个样本的概率数组.这些概率就是样本属于i类的机会.

When you run a xgb model with objective binary:logistic you get arrays of probabilities for each sample. Those probabilities are the chance of the sample to belong at class i.

假设您有3个课程[A, B, C].像[0.2, 0.6, 0.4]这样的样本y的输出表明该样本很可能属于类 B .

Let's say you have 3 classes [A, B, C]. An output for the sample y like [0.2, 0.6, 0.4] indicates that this sample will probabliy belong to class B.

如果只需要更可能的类,请使用此类概率数组中最大元素的索引,例如使用numpy函数

If you want just the more probable class, take the index of the maximum element in such probability array, for example using numpy function argmax.

您可以在xgb软件包参数的文档.

You can find more info at the xgb package parameter's documentation.

这篇关于XgBoost脚本无法正确输出二进制文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆