如何在Python中为机器学习处理缺失的NaN [英] How to handle missing NaNs for machine learning in python
问题描述
如何在应用机器学习算法之前处理数据集中的缺失值??
我注意到丢弃丢失的NAN值不是明智的选择.我通常做插值(计算平均值)使用大熊猫和它填平这是一种工程,提高了分类精度,但未必是最好的事情的数据.
I noticed that it is not a smart thing to drop missing NAN values. I usually do interpolate (compute mean) using pandas and fill it up the data which is kind of works and improves the classification accuracy but may not be the best thing to do.
下面是一个非常重要的问题. 什么是数据集中处理缺失值的最佳方式是什么?
Here is a very important question. What is the best way to handle missing values in data set?
例如,如果你看到此数据集,只有30%具有原来的数据.
For example if you see this dataset, only 30% has original data.
Int64Index: 7049 entries, 0 to 7048
Data columns (total 31 columns):
left_eye_center_x 7039 non-null float64
left_eye_center_y 7039 non-null float64
right_eye_center_x 7036 non-null float64
right_eye_center_y 7036 non-null float64
left_eye_inner_corner_x 2271 non-null float64
left_eye_inner_corner_y 2271 non-null float64
left_eye_outer_corner_x 2267 non-null float64
left_eye_outer_corner_y 2267 non-null float64
right_eye_inner_corner_x 2268 non-null float64
right_eye_inner_corner_y 2268 non-null float64
right_eye_outer_corner_x 2268 non-null float64
right_eye_outer_corner_y 2268 non-null float64
left_eyebrow_inner_end_x 2270 non-null float64
left_eyebrow_inner_end_y 2270 non-null float64
left_eyebrow_outer_end_x 2225 non-null float64
left_eyebrow_outer_end_y 2225 non-null float64
right_eyebrow_inner_end_x 2270 non-null float64
right_eyebrow_inner_end_y 2270 non-null float64
right_eyebrow_outer_end_x 2236 non-null float64
right_eyebrow_outer_end_y 2236 non-null float64
nose_tip_x 7049 non-null float64
nose_tip_y 7049 non-null float64
mouth_left_corner_x 2269 non-null float64
mouth_left_corner_y 2269 non-null float64
mouth_right_corner_x 2270 non-null float64
mouth_right_corner_y 2270 non-null float64
mouth_center_top_lip_x 2275 non-null float64
mouth_center_top_lip_y 2275 non-null float64
mouth_center_bottom_lip_x 7016 non-null float64
mouth_center_bottom_lip_y 7016 non-null float64
Image 7049 non-null object
推荐答案
What is the best way to handle missing values in data set?
有没有最好的方式,每个解决方案/算法都有自己的优点和缺点(你甚至可以混合使用它们中的一些共同创建自己的战略,并调来了一个最好的相关参数满足您的数据,有有关此主题的许多研究/篇).
There is NO best way, each solution/algorithm has their own pros and cons (and you can even mix some of them together to create your own strategy and tune the related parameters to come up one best satisfy your data, there are many research/papers about this topic).
例如,均值插补是快速而简单的方法,但它会低估方差,而用平均值替换NaN会扭曲分布形状,而 KNN插补可能不是在时间复杂性方面的大型数据集理想的,因为它在迭代的所有数据点并为每个值的NaN进行计算,并假设是NaN的属性与其他属性相关.
For example, Mean Imputation is quick and simple, but it would underestimate the variance and the distribution shape is distorted by replacing NaN with the mean value, while KNN Imputation might not be ideal in a large data set in terms of time complexity, since it iterate over all the data points and perform calculation for each NaN value, and the assumption is that NaN attribute is correlated with other attributes.
How to handle missing values in datasets before applying machine learning algorithm??
在除的平均值估算的你提到,你也可以看看的 K近邻归责和回归归责的,并参考以强大的 Imputer 在 scikit学习来检查现有API来使用.
In addition to mean imputation you mention, you could also take a look at K-Nearest Neighbor Imputation and Regression Imputation, and refer to the powerful Imputer class in scikit-learn to check existing APIs to use.
KNN插补
计算平均最近的这点的NaN邻居k的
Calculate the mean of k nearest neighbors of this NaN point.
回归归因
一个回归模型估计来预测基于其他变量的变量的观测值,然后将该模型用于转嫁给值的情况下该变量被丢失.
A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing.
这里链接scikit的插补缺少的值的截面. 我还听说橙色库归责,但还没有机会使用它.
Here links to scikit's 'Imputation of missing values' section. I have also heard of Orange library for imputation, but haven't had a chance to use it yet.
这篇关于如何在Python中为机器学习处理缺失的NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!