使用类别增长率填写数据表中的缺失值 [英] Fill in missing values in a data.table using the growth rate by category

查看:67
本文介绍了使用类别增长率填写数据表中的缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个不完整的(时间)序列,我想使用其他序列中各个类别(国家/地区)的可用最新值和增长率来填充缺失的值.类别,缺失值不等长.这需要按顺序对一个变量应用一个函数:首先,我需要获取最后一个可用的数据点(可以在任何地方),然后将其除以1+增长率,然后移至下一个数据点并执行相同的操作.

I have incomplete (time) series where I would like to fill up missing values using available recent values and growth rates from another series, by category (countries). Categories, missing values are not equal length. This requires applying a function on a variable sequentially: first I need to take the last available data point (which can be anywhere) and divide it by 1+growth rate, then move to the next data point and do the same.

示例数据集和所需结果:

Example dataset and desired outcome:

require(data.table)
DT_desired<-data.table(category=c(rep("A",4),rep("B",4)),
           year=2010:2013,
           grwth=c(NA,.05,0.1,0,NA,0.1,0.15,0.2))
DT_desired[,values:=c(cumprod(c(1,DT_desired[category=="A"&!is.na(grwth),grwth]+1)),cumprod(c(1,DT_desired[category=="B"&!is.na(grwth),grwth]+1)))]

DT_example <- copy(DT_desired)[c(1,2,3,5),values:=NA]

我尝试过的方法:您可以通过for循环来执行此操作,但这在R中效率低下,不鼓励使用.我开始喜欢data.table的效率,我最好采用这种方式.我尝试了数据表的移位功能,该功能仅填充一个丢失的值(这是逻辑上的,因为我想在剩余时间丢失前一个值的同时尝试执行).

What I tried: you can do it by a for loop, but that is inefficient and discouraged in R. I came to like the efficiency of data.table, and I would preferably do it in that way. I have tried the shift function of data table, which only fills one missing value (which is logical as it tries to execute at the same time I guess, when the rest is missing the previous value).

DT_example[,values:=ifelse(is.na(values),shift(values,type = "lead")/(1+shift(grwth,type = "lead")),values),by=category]

我从其他帖子中收集到,您可能可以使用zoo程序包的rollapply函数来完成此操作,但是我只是觉得我应该能够在数据表中执行此操作,而无需再使用其他程序包,并且该解决方案相对简单而优雅,只是我没有足够的经验来找到它.

I gather from other posts that you probably can do it with the rollapply function of the zoo package, but I just got the feeling that I should be able to do it in data table without yet another additional package, and that the solution is relatively simple and elegant, just I am not experienced enough to find it.

如果我没有注意到适当的帖子,这很可能是重复的,很抱歉,但是我发现的任何内容都没有完全满足我的要求.

This may very well be a duplicate and sorry if I did not notice the appropriate post, but none of what I found did exactly what I want.

推荐答案

不确定在SO之外是否已解决此问题,但前几天引起了我的注意.我已经很长时间没有写Rcpp了,所以我认为这是一个好习惯.我知道您正在寻找本机的 data.table 解决方案,因此可以随意使用或保留它:

Not sure if this has been solved outside of SO, but it caught my eye the other day. I hadn't written Rcpp in a long time and figured this would be good practice. I know you were looking for a native data.table solution, so feel free to take it or leave it:

foo.cpp 文件的内容:

Contents of foo.cpp file:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector fillValues(NumericVector vals, NumericVector gRates){

  int n = vals.size();
  NumericVector out(n);

  double currentValue   = vals[n - 1];
  double currentGrowth  = gRates[n - 1];

  // initial assignment
  out[n - 1] = currentValue;

  for(int i = n - 2; i >= 0; i--){

    if(NumericVector::is_na(vals[i])){
      // If val[i] is na, we need prior values to populate it
      if(!((currentValue || currentValue == 0) && (currentGrowth || currentGrowth == 0))){
        // We need a currentValue and currentGrowth to base growth rate on, throw error
        Rcpp::stop("NaN Values for rates or value when needed actual value");
      } else {
        // Update value
        out[i] = currentValue / (1 + currentGrowth);
      }
    } else {
      out[i] = vals[i];
    }

    // update
    currentValue = out[i];
    if(!NumericVector::is_na(gRates[i])){
      currentGrowth = gRates[i];
    }
  }

  return out;
}

/*** R
require(data.table)
DT_desired<-data.table(category=c(rep("A",4),rep("B",4)),
                       year=2010:2013,
                       grwth=c(NA,.05,0.1,0,NA,0.1,0.15,0.2))

DT_desired[,values:=c(cumprod(c(1,DT_desired[category=="A"&!is.na(grwth),grwth]+1)),cumprod(c(1,DT_desired[category=="B"&!is.na(grwth),grwth]+1)))]

DT_example <- copy(DT_desired)[c(1,2,3,5),values:=NA]

DT_desired[]
DT_example[]

DT_example[, values:= fillValues(values, grwth)][]
*/

然后运行它:

> Rcpp::sourceCpp('foo.cpp')

# Removed output that created example data

> DT_desired[]
   category year grwth values
1:        A 2010    NA  1.000
2:        A 2011  0.05  1.050
3:        A 2012  0.10  1.155
4:        A 2013  0.00  1.155
5:        B 2010    NA  1.000
6:        B 2011  0.10  1.100
7:        B 2012  0.15  1.265
8:        B 2013  0.20  1.518

> DT_example[]
   category year grwth values
1:        A 2010    NA     NA
2:        A 2011  0.05     NA
3:        A 2012  0.10     NA
4:        A 2013  0.00  1.155
5:        B 2010    NA     NA
6:        B 2011  0.10  1.100
7:        B 2012  0.15  1.265
8:        B 2013  0.20  1.518

> DT_example[, values:= fillValues(values, grwth)][]
   category year grwth values
1:        A 2010    NA  1.000
2:        A 2011  0.05  1.050
3:        A 2012  0.10  1.155
4:        A 2013  0.00  1.155
5:        B 2010    NA  1.000
6:        B 2011  0.10  1.100
7:        B 2012  0.15  1.265
8:        B 2013  0.20  1.518

请注意,这是从前开始的,因此假定您要从最近的录音开始,然后再从更远的地方开始录音.它还假定您的数据集已排序.

Note this runs back to front, so it assumes you want to begin with the most recent recording and work to recordings from further back. It also assumes your dataset is sorted.

这篇关于使用类别增长率填写数据表中的缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆