替换分类数据中的缺失值 [英] replace missing values in categorical data

查看:96
本文介绍了替换分类数据中的缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一列包含分类数据红色",绿色",蓝色"和空单元格

Let's suppose I have a column with categorical data "red" "green" "blue" and empty cells

red
green
red
blue
NaN

我确定NaN属于红色,绿色,蓝色,我应该用颜色的平均值代替NaN还是一个过强的假设?将会是

I'm sure that the NaN belongs to red green blue, should I replace the NaN by the average of the colors or is a too strong assumption? It will be

col1 | col2 | col3
  1      0     0
  0      1     0
  1      0     0
  0      0     1
 0.5    0.25  0.25

甚至缩放最后一行但保持比例,以使这些值的影响较小?通常最佳做法是什么?

Or even scale the last row but keeping the ratio so these values have less influence? Usually what is the best practice?

 0.25  0.125  0.125

推荐答案

这取决于您要对数据执行的操作. 这些颜色的平均值对您有用吗? 您正在创建一个新的可能的值,这样做可能是不希望的.尤其是因为您在谈论分类数据,并且像对待数字数据一样对其进行处理.

It depends on what you want to do with the data. Is the average of these colors useful for your purpose? You are creating a new possible value doing that, that is probably not wanted. Especially since you are talking about categorical data, and you are handling it as if it was numeric data.

在机器学习中,您将用关于目标属性(您要预测的内容)的最常见类别值替换缺失的值.

In Machine Learning you would replace the missing values with the most common categorical value regarding a target attribute (what you want to predict).

示例:您想通过看他们的汽车来预测一个人是男性还是女性,并且颜色特征缺少一些值.如果来自男性(女)驾驶员的大多数汽车是蓝色(红色),则您将使用该值来填充来自男性(女)驾驶员的汽车的缺失条目.

Example: You want to predict if a person is male or female by looking at their car, and the color feature has some missing values. If most of the cars from male(female) drivers are blue(red), you would use that value to fill missing entries of cars from male(female) drivers.

这篇关于替换分类数据中的缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆