pandas :处理测试中看不见的数据 [英] Pandas: Handle Unseen Data In Test

查看:47
本文介绍了 pandas :处理测试中看不见的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个训练数据集,并且正在构建一些机器学习模型.我无权使用测试集,并且想解决在火车上未观察到测试中的一种分类功能的可能性.

I have a training dataset and am building some machine learning models. I don't have access to the test set and want to handle the possibility that one of the categorical features in test wasn't observed in train.

这是一个玩具示例,说明了我的意思:

Here's a toy example illustrating what I mean:

我有一个数据框old,如下所示:

I have a DataFrame, old, like this:

old = pd.DataFrame({"car":["Audi", "BMW", "Mazda"]})

看起来像这样:

    car
0   Audi
1   BMW
2   Mazda

我现在像这样一口气编码:

I now one-hot encode like this:

new = pd.get_dummies(old)

然后回来:

   car_Audi car_BMW car_Mazda
0      1       0       0
1      0       1       0
2      0       0       1

这一切都很好.但是,如果我在测试中遇到如下一行:

This is all good. However, if I encounter a row in test that looks like:

    car
0   Mercedes

我可以进行一次热编码,但最后会得到一列我没有测试的列.

I can one-hot encode, but I'll end up with a column that I didn't have in test.

Pandas中有一种方法可以忽略我在火车上未曾见过的测试中的值吗?

Is there a way in Pandas to just ignore values in test that I haven't seen in train?

因此,我的梅赛德斯行的期望输出为:

So the desired output for my Mercedes row would be:

   car_Audi car_BMW car_Mazda
0      0       0       0

谢谢!

推荐答案

您可以使用reindex来实现

old = pd.DataFrame({"car":["Audi", "BMW", "Mazda"]})
new = pd.get_dummies(old)
test= pd.DataFrame({"car":["Audi", "BMW", "Mazda","Mercedes"]})
pd.get_dummies(test).reindex(columns=new.columns)
Out[460]: 
   car_Audi  car_BMW  car_Mazda
0         1        0          0
1         0        1          0
2         0        0          1
3         0        0          0 # row 3 will be all 0 after modified 

这篇关于 pandas :处理测试中看不见的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆