在插补方法之间进行选择 [英] Choosing between imputation methods

查看:56
本文介绍了在插补方法之间进行选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试评估 2 种数据插补方法.
我的数据集:https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

I'm trying to evaluate 2 methods for imputation of data.
My dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

我的目标标签是 LotFrontage.
首先,我使用 OneHotEncoding 对所有类别特征进行编码,然后使用相关矩阵并过滤 -0.30.3 之上的任何内容.

My target label is LotFrontage.
First I encoded all categorial features with OneHotEncoding and then I used the correlation matrix and filter anything above -0.3 or blow 0.3.

encoded_df = pd.get_dummies(train_df, prefix_sep="_", columns=['MSZoning', 'Street', 'Alley',
                                                       'LotShape', 'LandContour', 'Utilities',
                                                       'LotConfig', 'LandSlope', 'Neighborhood',
                                                       'Condition1', 'Condition2', 'BldgType', 'HouseStyle'])

corrmat = encoded_df.corr()
corrmat[(corrmat > 0.3) | (corrmat < -0.3)]
# filtering out based on corrmat output...
encoded_df = encoded_df[['SalePrice', 'MSSubClass', 'LotFrontage', 'LotArea',
                         'BldgType_1Fam', 'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE',
                         'MSZoning_C (all)', 'MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM']]

然后我尝试了两种插补方法:

Then I try two imputation methods:

  1. 使用LotFrontage的平均值(使用这个方法是因为我看到了低离群率)
  2. 尝试使用 DecisionTreeRegressor
  3. 预测 LotFrontage
  1. use the mean value of LotFrontage (used this method because I saw low outlier ratio)
  2. Tried to predict LotFrontage with DecisionTreeRegressor

# imputate LotFrontage with the mean value (we saw low outliers ratio so we gonna use this)
encoded_df1 = encoded_df.copy()
encoded_df1['LotFrontage'].fillna(encoded_df['LotFrontage'].mean(), inplace=True)
X1 = encoded_df1.drop('LotFrontage', axis=1)
y1 = encoded_df1['LotFrontage']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1)
classifier1 = DecisionTreeRegressor()
classifier1.fit(X1_train, y1_train)
y1_pred = classifier1.predict(X1_test)
print('score1: ', classifier1.score(X1_test, y1_test))

# imputate LotFrontage with by preditcing it using DecisionTreeRegressor
encoded_df2 = encoded_df.copy()
X2 = encoded_df2[~encoded_df2['LotFrontage'].isnull()].drop('LotFrontage', axis=1)
y2 = encoded_df2[~encoded_df2['LotFrontage'].isnull()]['LotFrontage']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2)
classifier2 = DecisionTreeRegressor()
classifier2.fit(X2_train, y2_train)
y2_pred = classifier2.predict(encoded_df2[encoded_df2['LotFrontage'].isnull()].drop('LotFrontage', axis=1))
imputated_encoded_df2 = encoded_df2[encoded_df2['LotFrontage'].isnull()].assign(LotFrontage=y2_pred)
X3 = imputated_encoded_df2.drop('LotFrontage', axis=1)
y3 = imputated_encoded_df2['LotFrontage']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3)
classifier2.fit(X3_train, y3_train)
y3_pred = classifier2.predict(X3_test)
print('score2: ', classifier2.score(X3_test, y3_test))

我的问题是:

  1. 我首先使用 fillna 与平均值然后拆分训练和测试并检查分数是否正确?因为如果我在拟合模型之前填充值,它会不会在插补数据上拟合模型,从而给我有偏差的结果?第二种方法相同
  2. 其他任何我做错的事情,因为我无法确定最佳插补方法,因为这两种方法的得分都很差,而且是随机得分
  1. Is it correct of me first using fillna with the mean value and then splitting to train and test and checking the score? Because if I'm filling the values prior to fitting the model won't it fit the model on the imputated data and thus giving me biased result? Same for the second method
  2. Anything else I'm doing wrong since I can't determine the best method for imputation since I get bad and random score for both methods

推荐答案

1.Imputation Using (Mean/Median) Values:

这是通过计算列中非缺失值的平均值/中位数,然后分别替换每列中的缺失值并独立于其他列来工作的.它只能用于数字数据.

This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.

优点:
简单快捷.
适用于小型数值数据集.

Pros:
Easy and fast.
Works well with small numerical datasets.

缺点:
不考虑特征之间的相关性.
它仅适用于列级别.
对编码的分类特征会产生很差的结果(不要在分类特征上使用它).
不是很准确.
未考虑估算中的不确定性.

Cons:
Doesn’t factor the correlations between features.
It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.

2. 使用(最频繁)或(零/常数)值的插补:

Most Frequency 是另一种估算缺失值和 YES 的统计策略!它通过用每列中最频繁的值替换缺失数据来处理分类特征(字符串或数字表示).

Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.

优点:
适用于分类特征.
缺点:
它也不考虑特征之间的相关性.
它可能会在数据中引入偏差.

Pros:
Works well with categorical features.
Cons:
It also doesn’t factor the correlations between features.
It can introduce bias in the data.

零或常量插补——顾名思义——它用零或您指定的任何常量值替换缺失值

Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify

3.使用 k-NN 进行插补:

k 个最近邻是一种用于简单分类的算法.该算法使用特征相似性"来预测任何新数据点的值.这意味着根据新点与训练集中的点的相似程度为新点分配一个值.这在通过使用缺失数据找到 k 与观测值最近的邻居,然后根据邻居中的非缺失值对它们进行插补来对缺失值进行预测时非常有用.

The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood.

它是如何工作的?
它创建一个基本的均值插补,然后使用生成的完整列表来构建 KDTree.然后,它使用生成的 KDTree 来计算最近邻 (NN).找到 k-NN 后,取它们的加权平均值.

How does it work?
It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them.

优点:
可能比均值、中值或最常用的插补方法准确得多(取决于数据集).

Pros:
Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).

缺点:
计算成本高.KNN 的工作原理是将整个训练数据集存储在内存中.K-NN 对数据中的异常值非常敏感(与 SVM 不同)

Cons:
Computationally expensive. KNN works by storing the whole training dataset in memory. K-NN is quite sensitive to outliers in the data (unlike SVM)

由于异常值比率较低,我们可以使用方法 3.
它对估算的目标变量(即 LotFrontage)与其他特征之间的相关性的影响也较小.

Since the outlier ratio is low we can use method 3.
It will also have less impact on the correlation between the imputed target variable(i.e LotFrontage) and other features.

import sys
from impyute.imputation.cs import fast_knn
sys.setrecursionlimit(100000) #Increase the recursion limit of the OS

# start the KNN training
train_df['LotFrontage']=fast_knn(train_df[['LotFrontage','1stFlrSF','MSSubClass']], k=30)

考虑到它们与 LotFrontage 列的相关性,我选择了这两个特征.

I've chosen the two features considering their correlation with the LotFrontage column.

这篇关于在插补方法之间进行选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆