XGBOOST 功能名称错误 - Python [英] XGBOOST feature name error - Python
问题描述
可能这个问题已经以不同的形式被问过很多次了.但是,我的问题是,当我将 XGBClassifier()
与生产之类的数据一起使用时,出现功能名称不匹配错误.我希望有人能告诉我我做错了什么.这是我的代码.顺便说一句,数据完全是编的:
Probably this question has been asked many times in different forms. However, my problem is when I use XGBClassifier()
with a production like data, I get a feature name mismatch error. I am hoping someone could please tell me what I am doing wrong. Here is my code. BTW, the data is completely made up:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score
import xgboost as xgb
data = {"Age":[44,27,30,38,40,35,70,48,50,37],
"BMI":["25-29","35-39","30-35","40-45","45-49","20-25","<19",">70","50-55","55-59"],
"BP":["<140/90",">140/90",">140/90",">140/90","<140/90","<140/90","<140/90",">140/90",">140/90","<140/90"],
"Risk":["No","Yes","Yes","Yes","No","No","No","Yes","Yes","No"]}
df = pd.DataFrame(data)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
labelencoder = LabelEncoder()
def encoder_X(columns):
for i in columns:
X.iloc[:, i] = labelencoder.fit_transform(X.iloc[:, i])
encoder_X([1,2])
y = labelencoder.fit_transform(y)
onehotencdoer = OneHotEncoder(categorical_features = [[1,2]])
X = onehotencdoer.fit_transform(X).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)
model = xgb.XGBClassifier()
model.fit(X_train, y_train, verbose = True)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: {0}%".format(accuracy*100))
到目前为止一切顺利,没有错误.准确率是 100%,但那是因为它是一个虚构的数据集,所以我并不担心.
So far so good, no error. The accuracy score is 100%, but that's because it is a made up data set so I am not worried about that.
当我尝试根据模型对新数据集进行分类时,出现特征名称不匹配错误":
When I try to classify a new dataset based on the model, I get "feature name mismatch error":
proddata = {"Age":[65,50,37],
"BMI":["25-29","35-39","30-35"],
"BP":["<140/90",">140/90",">140/90"]}
prod_df = pd.DataFrame(proddata)
def encoder_prod(columns):
for i in columns:
prod_df.iloc[:, i] = labelencoder.fit_transform(prod_df.iloc[:, i])
encoder_prod([1,2])
onehotencdoer = OneHotEncoder(categorical_features = [[1,2]])
prod_df = onehotencdoer.fit_transform(prod_df).toarray()
predictions = model.predict(prod_df)
此后我收到以下错误
predictions = model.predict(prod_df)
Traceback (most recent call last):
File "<ipython-input-24-456b5626e711>", line 1, in <module>
predictions = model.predict(prod_df)
File "c:\users\sozdemir\appdata\local\programs\python\python35\lib\site-packages\xgboost\sklearn.py", line 526, in predict
ntree_limit=ntree_limit)
File "c:\users\sozdemir\appdata\local\programs\python\python35\lib\site-packages\xgboost\core.py", line 1044, in predict
self._validate_features(data)
File "c:\users\sozdemir\appdata\local\programs\python\python35\lib\site-packages\xgboost\core.py", line 1288, in _validate_features
data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5']
expected f6, f11, f12, f9, f7, f8, f10 in input data
我知道这是由于 OneHotEncoding 在适合并转换为数组时发生的.不过我可能错了.
I know this is happening as a result of OneHotEncoding when fit and transform to an array. I might be wrong though.
如果这是 OneHotEncoding 的结果,我是否可以不使用 OneHotEncoding,因为 LabelEncoder() 已经对分类值进行了编码?
If this is as a result of OneHotEncoding, can I just not use OneHotEncoding since LabelEncoder() already codes the categorical values?
非常感谢您的任何帮助和反馈.
Thank you so much for any help and feedback.
PS:XGBOOST的版本是0.7
PS: The version of XGBOOST is 0.7
xgboost.__version__
Out[37]: '0.7'
推荐答案
看来编码器安装后需要保存.我使用了 sklearn
中的 joblib
.https://machinelearningmastery.com/ 的 Jason 给了我保存编码器的想法.以下为修改后的版本:
It seems like the encoder needs to be saved after it is being fitted. I used joblib
from sklearn
. Jason from https://machinelearningmastery.com/ gave me the idea of saving the encoder. The below is an edited version:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
import xgboost as xgb
data = {"Age":[44,27,30,38,40,35,70,48,50,37],
"BMI":["25-29","35-39","30-35","40-45","45-49","20-25","<19",">70","50-55","55-59"],
"BP":["<140/90",">140/90",">140/90",">140/90","<140/90","<140/90","<140/90",">140/90",">140/90","<140/90"],
"Risk":["No","Yes","Yes","Yes","No","No","No","Yes","Yes","No"]}
df = pd.DataFrame(data)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
labelencoder = LabelEncoder()
def encoder_X(columns):
for i in columns:
X.iloc[:, i] = labelencoder.fit_transform(X.iloc[:, i])
encoder_X([1,2])
y = labelencoder.fit_transform(y)
onehotencdoer = OneHotEncoder(categorical_features = [[1,2]])
onehotencdoer.fit(X)
enc = joblib.dump(onehotencdoer, "encoder.pkl") # save the fitted encoder
X = onehotencdoer.transform(X).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)
model = xgb.XGBClassifier()
model.fit(X_train, y_train, verbose = True)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: {0}%".format(accuracy*100))
现在,我们可以使用拟合编码器来转换 prod 数据:
And now, we can use the fitted encoder to transform the prod data:
proddata = {"Age":[65,50,37],
"BMI":["25-29","35-39","30-35"],
"BP":["<140/90",">140/90",">140/90"]}
prod_df = pd.DataFrame(proddata)
def encoder_prod(columns):
for i in columns:
prod_df.iloc[:, i] = labelencoder.fit_transform(prod_df.iloc[:, i])
encoder_prod([1,2])
enc = joblib.load("encoder.pkl")
prod_df = enc.transform(prod_df).toarray()
predictions = model.predict(prod_df)
results = [round(val) for val in predictions]
它似乎适用于这个例子,我将在更大的数据集上尝试这种方法.请告诉我你的想法.
It seems to be working for this example and I'll try this method at work for a larger data-set. Please, let me know what you think.
谢谢
这篇关于XGBOOST 功能名称错误 - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!