转换为DMatrix后XGBoost训练和测试功能的差异 [英] XGBoost difference in train and test features after converting to DMatrix

查看:538
本文介绍了转换为DMatrix后XGBoost训练和测试功能的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

只想知道下一个情况怎么可能:

Just wondering how is possible next case:

 def fit(self, train, target):
     xgtrain = xgb.DMatrix(train, label=target, missing=np.nan)
     self.model = xgb.train(self.params, xgtrain, self.num_rounds)

I passed the train dataset as csr_matrix with 5233 columns, and after converting to DMatrix I got 5322 features.

后来地预测步骤,收到错误如上错误原因:(

Later on predict step, I got an error as cause of above bug :(

 def predict(self, test):
     if not self.model:
         return -1
     xgtest = xgb.DMatrix(test)
     return self.model.predict(xgtest)

在这里输入的图像描述"

错误:...训练数据没有以下字段:f5232

Error: ... training data did not have the following fields: f5232

如何保证将我的 train/test 数据集正确转换为DMatrix?

How can I guarantee correct converting my train/test datasets to DMatrix?

是否有任何机会,以在类似至R Python的东西用?

Are there any chance to use in Python something similar to R?

# get same columns for test/train sparse matrixes
col_order <- intersect(colnames(X_train_sparse), colnames(X_test_sparse))
X_train_sparse <- X_train_sparse[,col_order]
X_test_sparse <- X_test_sparse[,col_order]

我的方法是行不通的,不幸的是:

My approach doesn't work, unfortunately:

def _normalize_columns(self):
    columns = (set(self.xgtest.feature_names) - set(self.xgtrain.feature_names)) | \
          (set(self.xgtrain.feature_names) - set(self.xgtest.feature_names))
    for item in columns:
        if item in self.xgtest.feature_names:
            self.xgtest.feature_names.remove(item)
        else:
            # seems, it's immutable structure and can not add any new item!!!
            self.xgtest.feature_names.append(item) 

推荐答案

另一种可能性是将一个功能级别专门用于训练数据而不是测试数据.这种情况大多发生后,同时一个热码,其结果是大矩阵具有的类别特征,每个级别的水平.在你的情况下,它看起来像f5232"是在训练或测试数据是独占的.如果这两种情况下的模型评分都可能引发错误(在ML包的大多数实现中),原因是:

One another possibility is to have one feature level exclusively in training data not in testing data. This situation happens mostly while post one hot encoding whose resultant is big matrix have level for each level of categorical features. In your case it looks like "f5232" is either exclusive in training or test data. If either case model scoring likely to throw error (in most implementations of ML packages) because:

  1. 如果专用于训练:模型对象将具有在模型公式该特征的参考.虽然得分就会抛出错误,说我无法找到此列.
  2. 如果专用于测试(较小可能作为测试数据通常比训练数据小):模型对象不会在模型公式该特征的参考.在计分时会抛出错误,说我得到了此列,但模型方程式没有此列.这也不太可能,因为大多数实现都意识到这种情况.

解决方案:

  1. 最好的自动的"解决方案是仅保持那些列,这是共同的训练和测试后一个热编码.
  2. 对于即席分析,如果您不能负担下降,因为它的重要性特征的水平,然后做分层抽样,以确保功能的所有水平被分配到训练和测试数据.

这篇关于转换为DMatrix后XGBoost训练和测试功能的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆