如何有效地推断多个变量的缺失数据 [英] How to efficiently extrapolate missing data for multiple variables
问题描述
我有面板数据,并且某些年份之前缺少许多变量的观测值.年份因变量而异.什么是一种有效的方法来推断多列中缺少的数据点?我正在考虑一些简单的事情,例如从线性趋势外推,但是我希望找到一种有效的方法来将预测应用于多个列.下面是一个样本数据集,其缺失与我正在处理的相似.在此示例中,我希望使用在每一列中观察到的数据点计算出的线性趋势,在国民生产总值"和国民平均预期寿命"变量中填写NA值.
I have panel data and numerous variables are missing observations before certain years. The years vary across variables. What is an efficient way to extrapolate for missing data points across multiple columns? I'm thinking of something as simple as extrapolation from a linear trend, but I'm hoping to find an efficient way to apply the prediction to multiple columns. Below is a sample data set with missingness similar to what I'm dealing with. In this example, I'm hoping to fill in the NA values in the "National GDP" and "National Life Expectancy" variables using a linear trend calculated with the observed data points in each column.
###Simulate National GDP values
set.seed(42)
nat_gdp <- c(replicate(20L, {
foo <- rnorm(3, mean = 2000, sd = 300) + c(0,1000,2000)
c(NA,NA,foo)}))
###Simulate national life expectancy values
nat_life <- c(replicate(20L, {
foo <- rnorm(2, mean = 55, sd = 7.8) + c(0,1.5)
c(NA,NA,NA,foo)}))
###Construct the data.table
data.sim <- data.table( GovernorateID = c(rep(seq.int(11L,15L,by=1L), each = 20)),
DistrictID =rep(seq.int(1100,1500,by=100),each=20 ) + rep(seq_len(4), each = 5),
Year = seq.int(1990,1994,by=1L),
National_gdp = nat_gdp ,
National_life_exp = nat_life )
推荐答案
我假设您要分别对每个 DistrictID
进行线性模型.
I assume that you want to do the linear model on each DistrictID
separately.
原始数据表:
head(data.sim)
## GovernorateID DistrictID Year National_gdp National_life_exp
## 1: 11 1101 1990 NA NA
## 2: 11 1101 1991 NA NA
## 3: 11 1101 1992 1988.746 NA
## 4: 11 1101 1993 2527.619 54.70739
## 5: 11 1101 1994 3854.210 44.21809
## 6: 11 1102 1990 NA NA
dd <- copy(data.sim) # Make a copy for later.
用线性模型的预测替换每个元素中的 NA
元素.一个链式操作中的两个步骤.
Replace NA
elements in each with the prediction of a linear model. Two steps in one chained operation.
data.sim[, National_life_exp := ifelse(is.na(National_life_exp),
predict(lm(National_life_exp ~ Year, data=.SD), .SD),
National_life_exp)
, by=DistrictID
][, National_gdp := ifelse(is.na(National_gdp),
predict(lm(National_gdp ~ Year, data=.SD), .SD),
National_gdp)
, by=DistrictID
]
head(data.sim)
## GovernorateID DistrictID Year National_gdp National_life_exp
## 1: 11 1101 1990 -8.004377 86.17531
## 2: 11 1101 1991 924.727559 75.68601
## 3: 11 1101 1992 1988.745871 65.19670
## 4: 11 1101 1993 2527.618676 54.70739
## 5: 11 1101 1994 3854.209743 44.21809
## 6: 11 1102 1990 1008.886661 70.45643
一个不错的(?)情节.请注意,在此示例中,每个 DistrictID
级别都恰好具有两个非NA点.
A nice (?) plot. Note that each level of DistrictID
has exactly two non-NA points in this example.
plot(data.sim$National_life_exp)
points(dd$National_life_exp, col='red') # The copy from before.
这篇关于如何有效地推断多个变量的缺失数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!