根据出现的概率填充缺失值 [英] Fill missing value based on probability of occurrence

查看:106
本文介绍了根据出现的概率填充缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这就是我的data.table/dataframe看起来像

This is what my data.table/dataframe looks lke

library(data.table)
dt <- fread('
   STATE     ZIP      
   PA        19333        
   PA        19327        
   PA        19333        
   PA        NA        
   PA        19355
   PA        19333
   PA        NA
   PA        19355
   PA        NA     
')

我在ZIP列中缺少三个值.我想根据缺失值在数据集中出现的概率用ZIPs的不丢失样本值填充.因此,例如ZIP 19333在数据集中出现了三次,而ZIP 19355在数据集中出现了两次,而19327出现了一次.因此ZIP 19333在数据集中出现PA的概率为50%,而19355的概率为33.33%,而19327的概率为16.17%.因此,在尝试填充三个缺失的ZIP时,选择19333的几率最高.最终填充的数据集可能类似于以下内容,其中两个缺失值由"19333"填充,而一个缺失值由"19355"填充:

I have three missing values in the ZIP column. I want to fill the missing values with nonmissing sample values of ZIPs according to their probability of occuring in the dataset. So for example ZIP 19333 occurs three times in the dataset and ZIP 19355 occurs twice in the dataset and 19327 occurs once. So ZIP 19333 has 50% probability of occurring in the dataset for PA, and 19355 has a 33.33% chance and 19327 has a 16.17% chance of occurring. So 19333 has the highest probability of being picked in trying to fill the three missing ZIPs. The final filled dataset may look like the following where two missing values are filled by '19333' and one was filled by '19355':

       STATE     ZIP      
       PA        19333        
       PA        19327        
       PA        19333        
       PA        19333       
       PA        19355
       PA        19333
       PA        19333
       PA        19355
       PA        19355    

我的数据集中有多个STATE.主要思想是根据给定的STATE发生ZIP的可能性来填充丢失的ZIP.

I have more than one STATE in my dataset. The main idea is to fill in missing ZIPs based on the probability of a ZIP occurring for a given STATE.

推荐答案

这是仅使用sample的一种方法,该方法包装在便利功能中.

Here's a way just using sample, wrapped up in a convenience function.

sample_fill_na = function(x) {
    x_na = is.na(x)
    x[x_na] = sample(x[!x_na], size = sum(x_na), replace = TRUE)
    return(x)
}

dt[, ZIP := sample_fill_na(ZIP), by = STATE]

这篇关于根据出现的概率填充缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆