如何有效地推断多个变量的缺失数据 [英] How to efficiently extrapolate missing data for multiple variables

查看:46
本文介绍了如何有效地推断多个变量的缺失数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有面板数据,并且某些年份之前缺少许多变量的观测值.年份因变量而异.什么是一种有效的方法来推断多列中缺少的数据点?我正在考虑一些简单的事情,例如从线性趋势外推,但是我希望找到一种有效的方法来将预测应用于多个列.下面是一个样本数据集,其缺失与我正在处理的相似.在此示例中,我希望使用在每一列中观察到的数据点计算出的线性趋势,在国民生产总值"和国民平均预期寿命"变量中填写NA值.

I have panel data and numerous variables are missing observations before certain years. The years vary across variables. What is an efficient way to extrapolate for missing data points across multiple columns? I'm thinking of something as simple as extrapolation from a linear trend, but I'm hoping to find an efficient way to apply the prediction to multiple columns. Below is a sample data set with missingness similar to what I'm dealing with. In this example, I'm hoping to fill in the NA values in the "National GDP" and "National Life Expectancy" variables using a linear trend calculated with the observed data points in each column.

###Simulate National GDP values
set.seed(42)
nat_gdp <- c(replicate(20L, {
  foo <- rnorm(3, mean = 2000, sd = 300) + c(0,1000,2000) 
  c(NA,NA,foo)}))
###Simulate national life expectancy values

nat_life <- c(replicate(20L, {
  foo <-  rnorm(2, mean = 55, sd = 7.8) + c(0,1.5)
  c(NA,NA,NA,foo)}))




###Construct the data.table       
data.sim <- data.table(  GovernorateID = c(rep(seq.int(11L,15L,by=1L), each = 20)), 
                         DistrictID =rep(seq.int(1100,1500,by=100),each=20 ) + rep(seq_len(4), each = 5), 
                         Year = seq.int(1990,1994,by=1L),
                         National_gdp =  nat_gdp   , 
                         National_life_exp =    nat_life  )

推荐答案

我假设您要分别对每个 DistrictID 进行线性模型.

I assume that you want to do the linear model on each DistrictID separately.

原始数据表:

head(data.sim)
##    GovernorateID DistrictID Year National_gdp National_life_exp
## 1:            11       1101 1990           NA                NA
## 2:            11       1101 1991           NA                NA
## 3:            11       1101 1992     1988.746                NA
## 4:            11       1101 1993     2527.619          54.70739
## 5:            11       1101 1994     3854.210          44.21809
## 6:            11       1102 1990           NA                NA

dd <- copy(data.sim) # Make a copy for later.

用线性模型的预测替换每个元素中的 NA 元素.一个链式操作中的两个步骤.

Replace NA elements in each with the prediction of a linear model. Two steps in one chained operation.

data.sim[, National_life_exp := ifelse(is.na(National_life_exp), 
                                       predict(lm(National_life_exp ~ Year, data=.SD), .SD),
                                       National_life_exp)
         , by=DistrictID
         ][, National_gdp := ifelse(is.na(National_gdp),
                                    predict(lm(National_gdp ~ Year, data=.SD), .SD),
                                    National_gdp) 
           , by=DistrictID
        ]


head(data.sim)
##    GovernorateID DistrictID Year National_gdp National_life_exp
## 1:            11       1101 1990    -8.004377          86.17531
## 2:            11       1101 1991   924.727559          75.68601
## 3:            11       1101 1992  1988.745871          65.19670
## 4:            11       1101 1993  2527.618676          54.70739
## 5:            11       1101 1994  3854.209743          44.21809
## 6:            11       1102 1990  1008.886661          70.45643

一个不错的(?)情节.请注意,在此示例中,每个 DistrictID 级别都恰好具有两个非NA点.

A nice (?) plot. Note that each level of DistrictID has exactly two non-NA points in this example.

plot(data.sim$National_life_exp)
points(dd$National_life_exp, col='red') # The copy from before.

这篇关于如何有效地推断多个变量的缺失数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆