为什么会出现"ValueError:feature_names不匹配"的错误消息?在XGBoost中指定功能名称列表以进行可视化时? [英] Why am I getting a "ValueError: feature_names mismatch" when specifying the feature-name list in XGBoost for visualization?

查看:30
本文介绍了为什么会出现"ValueError:feature_names不匹配"的错误消息?在XGBoost中指定功能名称列表以进行可视化时?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我在 XGBoost 使用的内部数据结构中定义数据矩阵时提到特征名称时,我收到此错误:

When I mention the feature names while defining the data matrix in an internal data structure used by XGBoost, I get this error:

d_train = xgboost.DMatrix(X_train, label=y_train, feature_names=list(X))
d_test = xgboost.DMatrix(X_test, label=y_test, feature_names=list(X))
...
...
...
shap_values = shap.TreeExplainer(model).shap_values(X_train)
shap.summary_plot(shap_values, X_train)

ValueError                                Traceback (most recent call last)
<ipython-input-59-4635c450279d> in <module>()
----> 1 shap_values = shap.TreeExplainer(model).shap_values(X_train)
      2 shap.summary_plot(shap_values, X_train)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\shap\explainers\tree.py in shap_values(self, X, **kwargs)
    104             if not str(type(X)).endswith("xgboost.core.DMatrix'>"):
    105                 X = xgboost.DMatrix(X)
--> 106             phi = self.trees.predict(X, pred_contribs=True)
    107         elif self.model_type == "lightgbm":
    108             phi = self.trees.predict(X, pred_contrib=True)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs)
   1042             option_mask |= 0x08
   1043 
-> 1044         self._validate_features(data)
   1045 
   1046         length = c_bst_ulong()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\core.py in _validate_features(self, data)
   1286 
   1287                 raise ValueError(msg.format(self.feature_names,
-> 1288                                             data.feature_names))
   1289 
   1290     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['Serial No', 'gender', 'Date', 'Product_Type', 'Product_Type', ... ... , 'Last_feature'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39']
<names of some features at column number corresponding to feature number in the following list> in input data
training data did not have the following fields: f7, f31, f33, f11, f6, f26, f2, f5, f17, f4, f37, f9, f1, f0, f39, f14, f12, f23, f13, f15, f22, f19, f35, f24, f38, f8, f28, f25, f20, f34, f27, f32, f36, f29, f16, f3, f21, f18, f30, f10

当我在定义 DMatrix 时没有指定特征名称时,我没有得到任何错误并得到以下输出图/图:

When I don't specify the feature names while defining the DMatrix, I get no errors and get the following output graph/plot:

但我需要在图中显示特征的名称,而不是 Feature 2Feature 15 等.为什么会发生此错误以及如何修复它吗?

But I need the names of the features to appear in the plot instead of Feature 2, Feature 15, etc. Why is this error occurring and how do I fix it?

如果需要的话,这是完整的代码,这基本上是我试图在

In case you want it, here's the full code, which is basically me trying to replicate the visualizations in this link, but for my dataset and accordingly customized model training parameters:

from sklearn.model_selection import train_test_split
import xgboost
import shap
import xlrd
import numpy as np
import matplotlib.pylab as pl

# print the JS visualization code to the notebook
shap.initjs()

import pandas as pd
data = pd.read_csv('InputCEM_FS_out.csv')
X = data.loc[:, data.columns != 'Score'] 
y = data['Score']
y = y/max(y)

# create a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)


# Some of values are float or integer and some object. This is why we need to cast them:
from sklearn import preprocessing 
for f in X_train.columns: 
    if X_train[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder() 
        lbl.fit(list(X_train[f].values)) 
        X_train[f] = lbl.transform(list(X_train[f].values))

for f in X_test.columns: 
    if X_test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder() 
        lbl.fit(list(X_test[f].values)) 
        X_test[f] = lbl.transform(list(X_test[f].values))

X_train.fillna((-999), inplace=True) 
X_test.fillna((-999), inplace=True)

X_train=np.array(X_train) 
X_test=np.array(X_test) 
X_train = X_train.astype(float) 
X_test = X_test.astype(float)

d_train = xgboost.DMatrix(X_train, label=y_train, feature_names=list(X)) # This gives the error later on. Remove the "feature_names=list(X)" part to not get the error.
d_test = xgboost.DMatrix(X_test, label=y_test, feature_names=list(X)) # This gives the error later on. Remove the "feature_names=list(X)" part to not get the error.

params = [
    ('max_depth', 3),
    ('eta', 0.025),
    ('objective', 'binary:logistic'),
    ('min_child_weight', 4),
    ('silent', 1),
    ('eval_metric', 'auc'),
    ('subsample', 0.75),
    ('colsample_bytree', 0.75),
    ('gamma', 0.75),
]

model = xgboost.train(params, d_train, 5000, evals = [(d_test, "test")], verbose_eval=100, early_stopping_rounds=20)

shap_values = shap.TreeExplainer(model).shap_values(X_train) # This line is what gives the error if the feature names are specified
shap.summary_plot(shap_values, X_train)

推荐答案

我们看到,问题是d_test的列被重命名为 f7,f31,... ),而d_train的列被重命名为不是.看来,原因在这里:

As we see, the issue is that d_test's columns are being renamed to f7, f31,...), while d_train's columns are not. It seems, the cause is here:

shap_values = shap.TreeExplainer(model).shap_values(X_train)

您传递了 X_train,而它只是一个没有列名的 numpy 数组(它们变成了 f31、f7 等等).相反,尝试传递具有所需列的 DataFrame:

You pass X_train, while it's just a numpy array without column names (they become f31, f7, and so on). Instead, try to pass a DataFrame with desired columns:

shap_values = shap.TreeExplainer(model).shap_values(pd.DataFrame(X_train,columns = X.columns))

这篇关于为什么会出现"ValueError:feature_names不匹配"的错误消息?在XGBoost中指定功能名称列表以进行可视化时?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆