pandas :处理测试中看不见的数据 [英] Pandas: Handle Unseen Data In Test
问题描述
我有一个训练数据集,并且正在构建一些机器学习模型.我无权使用测试集,并且想解决在火车上未观察到测试中的一种分类功能的可能性.
I have a training dataset and am building some machine learning models. I don't have access to the test set and want to handle the possibility that one of the categorical features in test wasn't observed in train.
这是一个玩具示例,说明了我的意思:
Here's a toy example illustrating what I mean:
我有一个数据框old
,如下所示:
I have a DataFrame, old
, like this:
old = pd.DataFrame({"car":["Audi", "BMW", "Mazda"]})
看起来像这样:
car
0 Audi
1 BMW
2 Mazda
我现在像这样一口气编码:
I now one-hot encode like this:
new = pd.get_dummies(old)
然后回来:
car_Audi car_BMW car_Mazda
0 1 0 0
1 0 1 0
2 0 0 1
这一切都很好.但是,如果我在测试中遇到如下一行:
This is all good. However, if I encounter a row in test that looks like:
car
0 Mercedes
我可以进行一次热编码,但最后会得到一列我没有测试的列.
I can one-hot encode, but I'll end up with a column that I didn't have in test.
Pandas中有一种方法可以忽略我在火车上未曾见过的测试中的值吗?
Is there a way in Pandas to just ignore values in test that I haven't seen in train?
因此,我的梅赛德斯行的期望输出为:
So the desired output for my Mercedes row would be:
car_Audi car_BMW car_Mazda
0 0 0 0
谢谢!
推荐答案
您可以使用reindex
来实现
old = pd.DataFrame({"car":["Audi", "BMW", "Mazda"]})
new = pd.get_dummies(old)
test= pd.DataFrame({"car":["Audi", "BMW", "Mazda","Mercedes"]})
pd.get_dummies(test).reindex(columns=new.columns)
Out[460]:
car_Audi car_BMW car_Mazda
0 1 0 0
1 0 1 0
2 0 0 1
3 0 0 0 # row 3 will be all 0 after modified
这篇关于 pandas :处理测试中看不见的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!