如何在培训/验证/测试中对齐pandas get_dummies? [英] How can I align pandas get_dummies across training / validation / testing?
问题描述
我有3组数据(培训,验证和测试),当我运行时:
I have 3 sets of data (training, validation and testing) and when I run:
training_x = pd.get_dummies(training_x, columns=['a', 'b', 'c'])
它为我提供了一定数量的功能.但是,当我在验证数据上运行它时,它给了我一个不同的数字,并且用于测试是相同的.有什么方法可以对所有数据集进行规范化(我知道错吗?),以便使特征数量对齐?
It gives me a certain number of features. But then when I run it across validation data, it gives me a different number and the same for testing. Is there any way to normalize (wrong word, I know) across all data sets so the number of features aligns?
推荐答案
假人应在将数据集分为训练,测试或验证之前创建
dummies should be created before dividing the dataset into train, test or validate
假设我具有如下训练和测试数据框
suppose i have train and test dataframe as follows
import pandas as pd
train = pd.DataFrame([1,2,3], columns= ['A'])
test= pd.DataFrame([7,8], columns= ['A'])
#creating dummy for train
pd.get_dummies(train, columns= ['A'])
o/p
A_1 A_2 A_3 A_4 A_5 A_6
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 1 0 0 0
3 0 0 0 1 0 0
4 0 0 0 0 1 0
5 0 0 0 0 0 1
# creating dummies for test data
pd.get_dummies(test, columns = ['A'])
A_7 A_8
0 1 0
1 0 1
因此7和8类的虚拟对象仅会出现在测试中,因此结果将具有不同的功能
so dummy for 7 and 8 category will only be present in test and thus will result with different feature
final_df = pd.concat([train, test])
dummy_created = pd.get_dummies(final_df)
# now you can split it into train and test
from sklearn.model_selection import train_test_split
train_x, test_x = train_test_split(dummy_created, test_size=0.33)
现在的火车和测试将具有相同的功能
Now train and test will have same set of features
这篇关于如何在培训/验证/测试中对齐pandas get_dummies?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!