根据出现的概率填充缺失值 [英] Fill missing value based on probability of occurrence
问题描述
这就是我的data.table/dataframe看起来像
This is what my data.table/dataframe looks lke
library(data.table)
dt <- fread('
STATE ZIP
PA 19333
PA 19327
PA 19333
PA NA
PA 19355
PA 19333
PA NA
PA 19355
PA NA
')
我在ZIP
列中缺少三个值.我想根据缺失值在数据集中出现的概率用ZIPs
的不丢失样本值填充.因此,例如ZIP 19333在数据集中出现了三次,而ZIP 19355在数据集中出现了两次,而19327出现了一次.因此ZIP 19333在数据集中出现PA
的概率为50%,而19355的概率为33.33%,而19327的概率为16.17%.因此,在尝试填充三个缺失的ZIP时,选择19333的几率最高.最终填充的数据集可能类似于以下内容,其中两个缺失值由"19333"填充,而一个缺失值由"19355"填充:
I have three missing values in the ZIP
column. I want to fill the missing values with nonmissing sample values of ZIPs
according to their probability of occuring in the dataset. So for example ZIP 19333 occurs three times in the dataset and ZIP 19355 occurs twice in the dataset and 19327 occurs once. So ZIP 19333 has 50% probability of occurring in the dataset for PA
, and 19355 has a 33.33% chance and 19327 has a 16.17% chance of occurring. So 19333 has the highest probability of being picked in trying to fill the three missing ZIPs. The final filled dataset may look like the following where two missing values are filled by '19333' and one was filled by '19355':
STATE ZIP
PA 19333
PA 19327
PA 19333
PA 19333
PA 19355
PA 19333
PA 19333
PA 19355
PA 19355
我的数据集中有多个STATE
.主要思想是根据给定的STATE
发生ZIP的可能性来填充丢失的ZIP.
I have more than one STATE
in my dataset. The main idea is to fill in missing ZIPs based on the probability of a ZIP occurring for a given STATE
.
推荐答案
这是仅使用sample
的一种方法,该方法包装在便利功能中.
Here's a way just using sample
, wrapped up in a convenience function.
sample_fill_na = function(x) {
x_na = is.na(x)
x[x_na] = sample(x[!x_na], size = sum(x_na), replace = TRUE)
return(x)
}
dt[, ZIP := sample_fill_na(ZIP), by = STATE]
这篇关于根据出现的概率填充缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!