如何在Python中为机器学习处理缺失的NaN [英] How to handle missing NaNs for machine learning in python

查看:596
本文介绍了如何在Python中为机器学习处理缺失的NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在应用机器学习算法之前处理数据集中的缺失值??

我注意到丢弃丢失的NAN值不是明智的选择.我通常做插值(计算平均值)使用大熊猫和它填平这是一种工程,提高了分类精度,但未必是最好的事情的数据.

I noticed that it is not a smart thing to drop missing NAN values. I usually do interpolate (compute mean) using pandas and fill it up the data which is kind of works and improves the classification accuracy but may not be the best thing to do.

下面是一个非常重要的问题. 什么是数据集中处理缺失值的最佳方式是什么?

Here is a very important question. What is the best way to handle missing values in data set?

例如,如果你看到此数据集,只有30%具有原来的数据.

For example if you see this dataset, only 30% has original data.

Int64Index: 7049 entries, 0 to 7048
Data columns (total 31 columns):
left_eye_center_x            7039 non-null float64
left_eye_center_y            7039 non-null float64
right_eye_center_x           7036 non-null float64
right_eye_center_y           7036 non-null float64
left_eye_inner_corner_x      2271 non-null float64
left_eye_inner_corner_y      2271 non-null float64
left_eye_outer_corner_x      2267 non-null float64
left_eye_outer_corner_y      2267 non-null float64
right_eye_inner_corner_x     2268 non-null float64
right_eye_inner_corner_y     2268 non-null float64
right_eye_outer_corner_x     2268 non-null float64
right_eye_outer_corner_y     2268 non-null float64
left_eyebrow_inner_end_x     2270 non-null float64
left_eyebrow_inner_end_y     2270 non-null float64
left_eyebrow_outer_end_x     2225 non-null float64
left_eyebrow_outer_end_y     2225 non-null float64
right_eyebrow_inner_end_x    2270 non-null float64
right_eyebrow_inner_end_y    2270 non-null float64
right_eyebrow_outer_end_x    2236 non-null float64
right_eyebrow_outer_end_y    2236 non-null float64
nose_tip_x                   7049 non-null float64
nose_tip_y                   7049 non-null float64
mouth_left_corner_x          2269 non-null float64
mouth_left_corner_y          2269 non-null float64
mouth_right_corner_x         2270 non-null float64
mouth_right_corner_y         2270 non-null float64
mouth_center_top_lip_x       2275 non-null float64
mouth_center_top_lip_y       2275 non-null float64
mouth_center_bottom_lip_x    7016 non-null float64
mouth_center_bottom_lip_y    7016 non-null float64
Image                        7049 non-null object

推荐答案

What is the best way to handle missing values in data set?

有没有最好的方式,每个解决方案/算法都有自己的优点和缺点(你甚至可以混合使用它们中的一些共同创建自己的战略,并调来了一个最好的相关参数满足您的数据,有有关此主题的许多研究/篇).

There is NO best way, each solution/algorithm has their own pros and cons (and you can even mix some of them together to create your own strategy and tune the related parameters to come up one best satisfy your data, there are many research/papers about this topic).

例如,均值插补是快速而简单的方法,但它会低估方差,而用平均值替换NaN会扭曲分布形状,而 KNN插补可能不是在时间复杂性方面的大型数据集理想的,因为它在迭代的所有数据点并为每个值的NaN进行计算,并假设是NaN的属性与其他属性相关.

For example, Mean Imputation is quick and simple, but it would underestimate the variance and the distribution shape is distorted by replacing NaN with the mean value, while KNN Imputation might not be ideal in a large data set in terms of time complexity, since it iterate over all the data points and perform calculation for each NaN value, and the assumption is that NaN attribute is correlated with other attributes.

How to handle missing values in datasets before applying machine learning algorithm??

在除的平均值估算的你提到,你也可以看看的 K近邻归责回归归责的,并参考以强大的 Imputer scikit学习来检查现有API来使用.

In addition to mean imputation you mention, you could also take a look at K-Nearest Neighbor Imputation and Regression Imputation, and refer to the powerful Imputer class in scikit-learn to check existing APIs to use.

KNN插补

计算平均最近的这点的NaN邻居k的

Calculate the mean of k nearest neighbors of this NaN point.

回归归因

一个回归模型估计来预测基于其他变量的变量的观测值,然后将该模型用于转嫁给值的情况下该变量被丢失.

A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing.

这里链接scikit的插补缺少的值的截面. 我还听说橙色库归责,但还没有机会使用它.

Here links to scikit's 'Imputation of missing values' section. I have also heard of Orange library for imputation, but haven't had a chance to use it yet.

这篇关于如何在Python中为机器学习处理缺失的NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆