数据分析中的价值缺失 [英] Missing Value in Data Analysis

查看:115
本文介绍了数据分析中的价值缺失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中包含两个级别Male(M)和Female(F)的变量GENDER具有很多Missing值.我该如何处理价值缺失的问题?处理这些缺失值的方法有哪些?任何帮助将不胜感激.

I have a data set in which the variable GENDER containing two levels Male(M) and Female(F) has lot of Missing values . How do i deal with missing value? What are the different methods to handle these missing values. Any help would be appreciated.

推荐答案

有几种技术可以估算缺失值.我一直在为Uni的一个项目撰写有关此类方法的论文.
我将简要解释5种常用的缺失数据插补技术.在下文中,我们将考虑一个数据集,其中每一行都是一个模式(或观察值),每一列都是一个特征(或属性),并且假设我们要修复"在其 j中具有缺失值的给定模式 -th特征(位置).

There are several techniques in order to estimate a missing value. I've been writing a paper for a project at Uni regarding such methods.
I will briefly explain 5 commonly used missing data imputation techniques. Hereinafter we will consider a dataset in which every row is a pattern (or observation) and every column is a feature (or attribute) and let's say we want to "fix" a given pattern which has a missing value in its j-th feature (position).

  • 去除图案.
    如果这种模式具有至少一个缺失值,请从数据集中删除模式.
    但是,如果存在大量缺少值的模式,则我不建议采用这种方法,因为数据集中的模式数量将大大减少,并且训练阶段不会令人费解.
  • 均值/众数方法.
    如果pattern在位置 j 中缺少值,则取平均值(如果 j 的属性是连续的)或众数(如果 j th的属性)是第 j 列的类别,并在模式的第 j 个位置替换该均值/众数.显然,在均值/众数评估中,您应该仅考虑列 j 中的非缺失值.
  • 条件均值/众数.
    如果您有标签(即监督学习),则可以考虑以前的方法,但要在均值/众数评估中考虑,只有 j 列中属于模式的(非缺失)元素与您要修复的图案具有相同的标签.这实质上改进了先前的方法,因为您不考虑属于不同类的模式的值.
  • 热装饰.
    给定特定的不相似度指标,您可以测量要修复的模式与所有其他不缺少要推算的属性(在我们的例子中为 j 属性)中不缺少值的模式之间的相异性.从最相似的模式中选择第 j 个特征,然后将其替换为您要修复的模式的第 j 个位置.
  • K近邻.
    这类似于热装饰",但是您可以考虑 K 最相似的模式,而这些模式在我们的第 j 个功能中不会丢失任何值,而不是考虑最相似的模式.然后考虑这些 K 模式的第 j 个特征中最频繁出现的项(模式).
  • Pattern removal.
    Remove pattern from dataset if such pattern has at least one missing value.
    If there are loads of patterns with missing values, however, I would not suggest such approach since the number of patterns in your dataset will drastically decrease and the training phase will not be adeguate.
  • The mean/mode approach.
    If pattern has a missing value in position j take the mean (if j-th attribute is continuous) or mode (if j-th attribute is categorical) of the j-th column and substitute such mean/mode in your pattern's j-th position. Obviously in the mean/mode evaluation you should consider only non-missing values from column j.
  • The conditional mean/mode.
    If you have the labels (i.e. supervised learning), you can consider the previous approach but taking into account, in the mean/mode evaluation, only (non-missing) elements from column j belonging to patterns that have the very same label as the pattern you're trying to fix. This essentially refines the previous method because you do not consider values for patterns belonging to a different class.
  • Hot-decking.
    Given a certain dissimilarity metric, you can measure the dissimilarity between the pattern you want to fix and all the other patterns that are not missing values in the attribute to be imputed (j-th attribute in our case). Take the j-th feature from the most similar pattern and substitute it back in the j-th position of the pattern you want to fix.
  • K-Nearest Neighbours.
    That is similar to Hot-decking but instead of considering the most similar pattern, you can consider the K most similar patterns that are not missing value in our j-th feature. Consider then the most frequent item (mode) amongst the j-th feature of these K patterns.

最近的K个邻居的 K 值可以通过交叉验证找到,可以先验设置,也可以使用经验法则值( K =实例数的平方根.

The K value for the K-Nearest Neighbours can be found by cross-validation, can be set a priori or you can use the rule-of-thumb value (K = square root of the number of instances).

差异性度量实际上取决于您,但是一个常见的选择是HEOM(异构欧氏重叠度量),可以找到

The dissimilarity measure is actually up to you, but a common choice is the HEOM (Heterogeneous Euclidean Overlap Metric) which can be found here (Section 2.3). Such dissimilarity measure is pretty valid in datasets with loads of missing values since it allows you to deal with patterns having missing values as well (obviously not in the feature you want to estimate).
It is indeed important to discard patterns that are missing value in the feature to be imputed: if your dissimilarity measure returns the most similar pattern that also is missing value in feature j, you are basically substituting a missing value with another missing value. Pointless. This example works for the Hot-decking but you can extend such concept even for the K most similar patterns in the K-nearest neighbours (i.e. the unlucky case in which the most frequent item amongst the j-th feature for the K most similar patterns is a missing value as well).

这篇关于数据分析中的价值缺失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆