处理训练和测试数据中的不同因子水平 [英] Handling different Factor Levels in Train and Test data

查看:112
本文介绍了处理训练和测试数据中的不同因子水平的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个20列的训练数据集,所有这些都是我必须用来训练模型的因素,已经获得了测试数据集,我必须在该数据集上应用我的模型进行预测并提交。



由于我们正在处理所有类别变量,我出于好奇而检查了训练数据和测试数据的水平。令我失望的是,大多数类别(变量)在训练和测试数据集中有不同的水平。



例如

  table(train $ cap.shape)#训练数据列级别
bcfkx
196 4 2356 828 2300

table(test $ cap.shape)#测试数据

bfsx
256796 32 1356

这里我有一个类别s测试数据集中的extra,如何处理这些情况,训练中的c的额外类别很少,所以我考虑根据该因素将其与其他因素合并它的分布如何与因变量有关,但我仍然坚持如何处理测试中的额外水平。



更多示例

 表(train $ odor)#train 
cfmnpsy
189 2155 36 2150 2576576

表(test $ odor) #test

acflnp
400 3 5 400 1378 254

在在此列中,我们有2个额外的测试级别,其中测试数据集中有大量实例。我该如何处理这些差异。

  table(train $ sColour)#train 
bhknorwy
48 1627 700 753 48 72 2388 48

表(test $ sColour)#test
hknu
5 1172 1215 48

这里我们有u的额外因数



我是否应该首先在训练集上建立模型并找到重要的预测变量然后担心因子级别?

解决方案

拥有不同的功能集违反了机器学习的基本原理。训练和测试数据必须表示相同的数据空间。这些没有;尽管每对都有共同的特征(维度)核,但是要在同一模型上使用它们,则必须将每个集合简化为仅共同特征,或者将两者都扩展为特征的并集,并填写不要护理或其他功能的语义空值。


I have a training data set of 20 column , all of which are factors which i have to use for training a model, I have been given test data set on which I have to apply my model for predictions and submit.

I was doing initial data exploration and just out of curiosity checked the levels of training data and testing data levels since we are dealing with all categorical variables.To my dismay most of the categories (variables) have different levels in training and testing data set.

for example

table(train$cap.shape) #training data column levels
  b    c    f    k    x 
196    4 2356  828 2300

table(test$cap.shape) #test data 

 b    f    s    x 
256  796   32 1356

Here I have a category s extra in test data set , how can I handle these cases, the extra category of c in training is very low , so I was thinking to merge that factor with other factor based on how its distribution is with dependent variables, but I am stuck on how to handle the extra level in test.

More examples

table(train$odor) #train
  c    f    m    n    p    s    y 
 189 2155   36 2150    2  576  576

table(test$odor) #test

  a    c    f    l    n    p 
400    3    5  400 1378  254

In this column we have 2 extra levels in test with substantial number of instances in test data set. How can I handle these discrepancies.

table(train$sColour) #train
    b    h    k    n    o    r    w    y 
   48 1627  700  753   48   72 2388   48

   table(test$sColour) #test
    h    k    n    u 
    5 1172 1215   48

Here we have extra factor of u

Should I first build a model just on the training set and find the important predictors and then worry about the factor levels ?

解决方案

Having different feature sets violates a basic precept of machine learning. The training and test data must represent the same data space. These do not; although each pair has a common kernel of features (dimensions), to use them on the same model, you would have to reduce each set to only the common features, or extend both to the union of the features, filling in "don't care" or semantically null values for the extra features.

这篇关于处理训练和测试数据中的不同因子水平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆