R:用NAs填充数据空白并应用cumsum函数 [英] R: filling up data gaps with NAs and applying cumsum function

查看:444
本文介绍了R:用NAs填充数据空白并应用cumsum函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请求我分解我在这里问的问题()稍稍发布一个较小的样本。在这里,您可以在此找到我的示例数据: https://dl.dropboxusercontent.com /u/16277659/inputdata.csv

  NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE 
SAMPLE1; 253; 1883年1883年0
SAMPLE1; 253; 1884年1883年NA
SAMPLE1; 253; 1885年1884年12
SAMPLE1; 253; 1890年1889年17
SAMPLE2; 261; 1991年1991年0
SAMPLE2; 261; 1992年1991年-19
SAMPLE2; 261 1994年1992年-58
SAMPLE2; 261; 1995; 1994年-40

我想计算列VALUE的累积和,并填写数据空白与NA值之间的年份(数据的结构应该相同,因为我需要其他列进行进一步处理)。



填写数据空白时,应在SAMPLE1中填写NAs。请注意,在填写CUMSUM栏中的多个NA后,请注意NA之后的值的位置(例如,除了VALUE中的最后一个NA之外,还应填写最后一个CUMSUM值(用于绘制原因)。



在REFERENCE_YEAR和SURVEY_YEAR之间的期间大于一年的情况下,例外情况是该值应在1992至1994年期间在SAMPLE2中写入列中。



这只是一个样本数据集,我的实际数据集由几列和大约40000行组成,最好是BaseR中的一个解决方案,每个SAMPLE的第一行中的REFERENCE_YEAR和SURVEY_YEAR等于我用于为每个组写入零列的代码的结果。

  NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE; CUMSUM 
SAMPLE1; 253; 1883; 1883; 0; 0
SAMPLE1; 253; 1884; 1883; NA; NA
SAMPLE1; 253; 1885; 1884 ; 12; 12
SAMPLE1; 253; 1886年1885年NA; NA
SAMPLE1; 253; 1887年1886年NA; NA
SAMPLE1; 253; 1888年1887年NA; NA
SAMPLE1; 253; 1889年1888年NA; 12
SAMPLE1; 253; 1890年1889年17; 29
SAMPLE2; 261; 1991年1991年0; 0
SAMPLE2; 261; 1992年1991年-19; -19
SAMPLE2; 261; 1993年1992年-58; -77
SAMPLE2; 261; 1994年1992年-58; -77
SAMPLE2; 261; 1995; 1994年-40; -117



------------------- -------------------------------------------------- -----------------------





解决方案

如果 dat 是数据集,一种方法是:



创建一个新数据集通过在每个 NAME



之间展开最小和最大值 SURVEY_YEAR ($($)$($)$($)$($)$($)$($) ,max(x)))))[2:1],c(NAME,SURVEY_YEAR))

将新数据集 dat1 与旧的 dat


$ b $合并b

  datN<  -  merge(dat1,dat,all = TRUE)

REFERENCE_YEAR 中的缺失值替换为上一行 SURVEY_YEAR

  datN $ REFERENCE_YEAR [is.na(datN $ REFERENCE_YEAR)]<  -  datN $ SURVEY_YEAR [whi ch(is.na(datN $ REFERENCE_YEAR)) -  1] 

使用 na.locf zoo 填写NA的 ID

 库(zoo)
datN $ ID< - na.locf(datN $ ID)
datN $ CUMSUM< NA

对非NA进行 cumsum VALUE 行和

  datN $ CUMSUM [!is.na(datN $ VALUE)]<  -  unlist(with(datN,tapply(VALUE,NAME,FUN = function(x)cumsum(x [!is.na(x)]))))

查找SURVEY_YEAR和REFERENCE_YEAR> 1之间有差异的行

  indx<  -  with(datN,SURVEY_YEAR-REFERENCE_YEAR)> 1 

VALUE CUMSUM 列中的这些行替换为下一行值

  datN [,c(VALUE,CUMSUM)]<  -  lapply(datN [,c(VALUE,CUMSUM)],函数){x [which(indx)-1]<  -  x [indx]; x})

更改一些 NA CUMSUM 之前的非NA

  datN $ CUMSUM<  -  with(datN,ave(CUMSUM,NAME,FUN = function(x){
x1 < - is.na(x)
(!(!(abs(x1-1)*(cumsum(x1)!= 0))*(r1 $ length))) - 1
indx1 < - indx [indx - c(1,indx [-length(indx)])> 1]
indxn < - unlist(lapply(indx1,function(y) $ b indx2 < - which(!is.na(x))
tail(indx2 [which(indx2< y)],1)
}))
x [indx1]< ; - x [indxn]
x
}))

datN
#NAME SURVEY_YEAR ID REFERENCE_YEAR VALUE CUMSUM
#1 SAMPLE1 1883 253 1883 0 0
#2 SAMPLE1 1884 253 1883 NA NA
#3 SAMPLE1 1885 253 1884 12 12
#4 SAMPLE1 1886 253 1885 NA NA
#5 SAMPLE1 1887 253 1886 NA N A
#6 SAMPLE1 1888 253 1887 NA NA
#7 SAMPLE1 1889 253 1888 NA 12
#8 SAMPLE1 1890 253 1889 17 29
#9 SAMPLE2 1991 261 1991 0 0
#10 SAMPLE2 1992 261 1991 -19 -19
#11 SAMPLE2 1993 261 1992 -58 -77
#12 SAMPLE2 1994 261 1992 -58 -77
#13 SAMPLE2 1995 261 1994 -40 -117


It was requested that I would break down my question asked here (R: Applying cumulative sum function and filling data gaps with NA for plotting) a little and post a smaller sample. Here it is and here you can find my sample data: https://dl.dropboxusercontent.com/u/16277659/inputdata.csv

NAME;       ID;     SURVEY_YEAR;    REFERENCE_YEAR; VALUE
SAMPLE1;    253;    1883;           1883;           0
SAMPLE1;    253;    1884;           1883;           NA
SAMPLE1;    253;    1885;           1884;           12
SAMPLE1;    253;    1890;           1889;           17
SAMPLE2;    261;    1991;           1991;           0
SAMPLE2;    261;    1992;           1991;           -19
SAMPLE2;    261;    1994;           1992;           -58
SAMPLE2;    261;    1995;           1994;           -40

I would like to calculate the cumulative sum for the column VALUE and fill up the data gaps for the years inbetween with NA values (the structure of the data should be the same, as I need the other columns for further processing).

When filling up the data gaps NAs should be filled in like in SAMPLE1. Please note the position of the values after NA when filling in multiple NAs in the column CUMSUM (e.g. the last CUMSUM value should be filled in besides the last NA in VALUE (used for plotting reasons).

An exception is the case when the period between REFERENCE_YEAR and SURVEY_YEAR is greater than one year, the value should be written into the column like in SAMPLE2 for the period 1992 to 1994.

This is only a sample dataset, my actual dataset consists of several columns and of about 40000 rows. Best would be a solution in BaseR. The REFERENCE_YEAR and SURVEY_YEAR being equal in the first row for each SAMPLE is the result of the code I use for writing a zero column for each group.

NAME;       ID;     SURVEY_YEAR;    REFERENCE_YEAR; VALUE;  CUMSUM
SAMPLE1;    253;    1883;           1883;           0;      0
SAMPLE1;    253;    1884;           1883;           NA;     NA
SAMPLE1;    253;    1885;           1884;           12;     12
SAMPLE1;    253;    1886;           1885;           NA;     NA
SAMPLE1;    253;    1887;           1886;           NA;     NA
SAMPLE1;    253;    1888;           1887;           NA;     NA
SAMPLE1;    253;    1889;           1888;           NA;     12
SAMPLE1;    253;    1890;           1889;           17;     29
SAMPLE2;    261;    1991;           1991;           0;      0
SAMPLE2;    261;    1992;           1991;           -19;    -19
SAMPLE2;    261;    1993;           1992;           -58;    -77
SAMPLE2;    261;    1994;           1992;           -58;    -77
SAMPLE2;    261;    1995;           1994;           -40;    -117

--------------------------------------------------------------------------------------------


解决方案

If dat is the dataset, one way would be:

Create a new dataset by expanding between minimum and maximum SURVEY_YEAR for each NAME

 dat1 <- setNames(stack(
             with(dat, tapply(SURVEY_YEAR, NAME, 
                FUN=function(x) seq(min(x), max(x)))))[2:1], c("NAME", "SURVEY_YEAR"))

Merge the new dataset dat1 with old dat

 datN <- merge(dat1, dat, all=TRUE)

Replace the missing values in REFERENCE_YEAR by SURVEY_YEAR from the previous row

 datN$REFERENCE_YEAR[is.na(datN$REFERENCE_YEAR)] <- datN$SURVEY_YEAR[which(is.na(datN$REFERENCE_YEAR))-1]

Use na.locf from zoo to fill the NA's for ID

 library(zoo)
 datN$ID <- na.locf(datN$ID)
 datN$CUMSUM <- NA

Do cumsum on the non-NA VALUE rows and

 datN$CUMSUM[!is.na(datN$VALUE)] <-  unlist(with(datN, tapply(VALUE, NAME, FUN=function(x) cumsum(x[!is.na(x)]))))

Look for rows having a difference between SURVEY_YEAR and REFERENCE_YEAR >1

 indx <- with(datN, SURVEY_YEAR-REFERENCE_YEAR)>1

Replace those rows in VALUE and CUMSUM columns with the next row values

 datN[,c("VALUE", "CUMSUM")] <- lapply(datN[,c("VALUE", "CUMSUM")], function(x) {x[which(indx)-1] <- x[indx]; x})

Change some of the NA values in CUMSUM to previous non-NA value

datN$CUMSUM <- with(datN, ave(CUMSUM, NAME, FUN = function(x) {
x1 <- is.na(x)
rl <- rle(x1)
indx <- which(!(!(abs(x1 - 1) * (cumsum(x1) != 0) * sequence(rl$lengths)))) - 1
indx1 <- indx[indx - c(1, indx[-length(indx)]) > 1]
indxn <- unlist(lapply(indx1, function(y) {
    indx2 <- which(!is.na(x))
    tail(indx2[which(indx2 < y)], 1)
}))
x[indx1] <- x[indxn]
x
}))

datN
#      NAME SURVEY_YEAR  ID REFERENCE_YEAR VALUE CUMSUM
#1  SAMPLE1        1883 253           1883     0      0
#2  SAMPLE1        1884 253           1883    NA     NA
#3  SAMPLE1        1885 253           1884    12     12
#4  SAMPLE1        1886 253           1885    NA     NA
#5  SAMPLE1        1887 253           1886    NA     NA
#6  SAMPLE1        1888 253           1887    NA     NA
#7  SAMPLE1        1889 253           1888    NA     12
#8  SAMPLE1        1890 253           1889    17     29
#9  SAMPLE2        1991 261           1991     0      0
#10 SAMPLE2        1992 261           1991   -19    -19
#11 SAMPLE2        1993 261           1992   -58    -77
#12 SAMPLE2        1994 261           1992   -58    -77
#13 SAMPLE2        1995 261           1994   -40   -117

这篇关于R:用NAs填充数据空白并应用cumsum函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆