在训练和测试数据中保持相同的虚拟变量 [英] Keep same dummy variable in training and testing data

查看:34
本文介绍了在训练和测试数据中保持相同的虚拟变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 python 构建一个预测模型,其中包含两个单独的训练和测试集.训练数据包含数字类型的分类变量,例如邮政编码,[91521,23151,12355, ...],以及字符串分类变量,例如城市 ['Chicago', 'New York', 'Los Angeles',...].

I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city ['Chicago', 'New York', 'Los Angeles', ...].

为了训练数据,我首先使用pd.get_dummies"来获取这些变量的虚拟变量,然后用转换后的训练数据拟合模型.

To train the data, I first use the 'pd.get_dummies' to get dummy variable of these variable, and then fit the model with the transformed training data.

我对我的测试数据进行相同的转换,并使用经过训练的模型预测结果.但是,我收到错误 'ValueError: Number of features of the model must match the input.模型 n_features 为 1487,输入 n_features 为 1345 '.原因是测试数据中的虚拟变量较少,因为它的城市"和邮政编码"较少.

I do the same transformation on my test data and predict the result using the trained model. However, I got the error 'ValueError: Number of features of the model must match the input. Model n_features is 1487 and input n_features is 1345 '. The reason is because there are fewer dummy variables in the test data because it has fewer 'city' and 'zipcode'.

我该如何解决这个问题?例如,'OneHotEncoder' 将只编码所有数字类型的分类变量.'DictVectorizer()' 将只编码所有字符串类型的分类变量.我在网上搜索并看到了一些类似的问题,但没有一个能真正解决我的问题.

How can I solve this problem? For example, 'OneHotEncoder' will only encode all numerical type categorical variable. 'DictVectorizer()' will only encode all string type categorical variable. I search on line and see a few similar questions but none of them really addresses my question.

使用 scikit-learn 处理分类特征

https://www.quora.com/If-the-training-dataset-has-more-variables-than-the-test-dataset-what-does-one-do

https://www.quora.com/What-is-the-best-way-to-do-a-binary-one-hot-one-of-K-coding-in-蟒蛇

推荐答案

您也可以直接获取缺失的列并将它们添加到测试数据集中:

You can also just get the missing columns and add them to the test dataset:

# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]

此代码还确保删除由测试数据集中的类别产生但不存在于训练数据集中的列

This code also ensure that column resulting from category in the test dataset but not present in the training dataset will be removed

这篇关于在训练和测试数据中保持相同的虚拟变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆