如何在培训/验证/测试中对齐pandas get_dummies? [英] How can I align pandas get_dummies across training / validation / testing?

查看:108
本文介绍了如何在培训/验证/测试中对齐pandas get_dummies?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有3组数据(培训,验证和测试),当我运行时:

I have 3 sets of data (training, validation and testing) and when I run:

    training_x = pd.get_dummies(training_x, columns=['a', 'b', 'c'])

它为我提供了一定数量的功能.但是,当我在验证数据上运行它时,它给了我一个不同的数字,并且用于测试是相同的.有什么方法可以对所有数据集进行规范化(我知道错吗?),以便使特征数量对齐?

It gives me a certain number of features. But then when I run it across validation data, it gives me a different number and the same for testing. Is there any way to normalize (wrong word, I know) across all data sets so the number of features aligns?

推荐答案

假人应在将数据集分为训练,测试或验证之前创建

dummies should be created before dividing the dataset into train, test or validate

假设我具有如下训练和测试数据框

suppose i have train and test dataframe as follows

import pandas as pd  
train = pd.DataFrame([1,2,3], columns= ['A'])
test= pd.DataFrame([7,8], columns= ['A'])

#creating dummy for train 
pd.get_dummies(train, columns= ['A'])

o/p
   A_1  A_2  A_3  A_4  A_5  A_6
0    1    0    0    0    0    0
1    0    1    0    0    0    0
2    0    0    1    0    0    0
3    0    0    0    1    0    0
4    0    0    0    0    1    0
5    0    0    0    0    0    1



# creating dummies for test data
pd.get_dummies(test, columns = ['A'])
    A_7  A_8
0    1    0
1    0    1

因此7和8类的虚拟对象仅会出现在测试中,因此结果将具有不同的功能

so dummy for 7 and 8 category will only be present in test and thus will result with different feature

final_df = pd.concat([train, test]) 

dummy_created = pd.get_dummies(final_df)

# now you can split it into train and test 
from sklearn.model_selection import train_test_split
train_x, test_x = train_test_split(dummy_created, test_size=0.33)

现在的火车和测试将具有相同的功能

Now train and test will have same set of features

这篇关于如何在培训/验证/测试中对齐pandas get_dummies?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆