R:用NAs填充数据空白并应用cumsum函数 [英] R: filling up data gaps with NAs and applying cumsum function
问题描述
请求我分解我在这里问的问题()稍稍发布一个较小的样本。在这里,您可以在此找到我的示例数据: https://dl.dropboxusercontent.com /u/16277659/inputdata.csv
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE
SAMPLE1; 253; 1883年1883年0
SAMPLE1; 253; 1884年1883年NA
SAMPLE1; 253; 1885年1884年12
SAMPLE1; 253; 1890年1889年17
SAMPLE2; 261; 1991年1991年0
SAMPLE2; 261; 1992年1991年-19
SAMPLE2; 261 1994年1992年-58
SAMPLE2; 261; 1995; 1994年-40
我想计算列VALUE的累积和,并填写数据空白与NA值之间的年份(数据的结构应该相同,因为我需要其他列进行进一步处理)。
填写数据空白时,应在SAMPLE1中填写NAs。请注意,在填写CUMSUM栏中的多个NA后,请注意NA之后的值的位置(例如,除了VALUE中的最后一个NA之外,还应填写最后一个CUMSUM值(用于绘制原因)。
在REFERENCE_YEAR和SURVEY_YEAR之间的期间大于一年的情况下,例外情况是该值应在1992至1994年期间在SAMPLE2中写入列中。
这只是一个样本数据集,我的实际数据集由几列和大约40000行组成,最好是BaseR中的一个解决方案,每个SAMPLE的第一行中的REFERENCE_YEAR和SURVEY_YEAR等于我用于为每个组写入零列的代码的结果。
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE; CUMSUM
SAMPLE1; 253; 1883; 1883; 0; 0
SAMPLE1; 253; 1884; 1883; NA; NA
SAMPLE1; 253; 1885; 1884 ; 12; 12
SAMPLE1; 253; 1886年1885年NA; NA
SAMPLE1; 253; 1887年1886年NA; NA
SAMPLE1; 253; 1888年1887年NA; NA
SAMPLE1; 253; 1889年1888年NA; 12
SAMPLE1; 253; 1890年1889年17; 29
SAMPLE2; 261; 1991年1991年0; 0
SAMPLE2; 261; 1992年1991年-19; -19
SAMPLE2; 261; 1993年1992年-58; -77
SAMPLE2; 261; 1994年1992年-58; -77
SAMPLE2; 261; 1995; 1994年-40; -117
------------------- -------------------------------------------------- -----------------------
如果 dat
是数据集,一种方法是:
创建一个新数据集通过在每个 NAME
之间展开最小和最大值
SURVEY_YEAR
($($)$($)$($)$($)$($)$($) ,max(x)))))[2:1],c(NAME,SURVEY_YEAR))将新数据集 dat1
与旧的 dat
$ b $合并b
datN< - merge(dat1,dat,all = TRUE)
将 REFERENCE_YEAR
中的缺失值替换为上一行 SURVEY_YEAR
datN $ REFERENCE_YEAR [is.na(datN $ REFERENCE_YEAR)]< - datN $ SURVEY_YEAR [whi ch(is.na(datN $ REFERENCE_YEAR)) - 1]
使用 na.locf
从 zoo
填写NA的 ID
库(zoo)
datN $ ID< - na.locf(datN $ ID)
datN $ CUMSUM< NA
对非NA进行 cumsum
VALUE
行和
datN $ CUMSUM [!is.na(datN $ VALUE)]< - unlist(with(datN,tapply(VALUE,NAME,FUN = function(x)cumsum(x [!is.na(x)]))))
查找SURVEY_YEAR和REFERENCE_YEAR> 1之间有差异的行
indx< - with(datN,SURVEY_YEAR-REFERENCE_YEAR)> 1
将 VALUE
和 CUMSUM
列中的这些行替换为下一行值
datN [,c(VALUE,CUMSUM)]< - lapply(datN [,c(VALUE,CUMSUM)],函数){x [which(indx)-1]< - x [indx]; x})
更改一些 NA
CUMSUM
之前的非NA
值
datN $ CUMSUM< - with(datN,ave(CUMSUM,NAME,FUN = function(x){
x1 < - is.na(x)
(!(!(abs(x1-1)*(cumsum(x1)!= 0))*(r1 $ length))) - 1
indx1 < - indx [indx - c(1,indx [-length(indx)])> 1]
indxn < - unlist(lapply(indx1,function(y) $ b indx2 < - which(!is.na(x))
tail(indx2 [which(indx2< y)],1)
}))
x [indx1]< ; - x [indxn]
x
}))
datN
#NAME SURVEY_YEAR ID REFERENCE_YEAR VALUE CUMSUM
#1 SAMPLE1 1883 253 1883 0 0
#2 SAMPLE1 1884 253 1883 NA NA
#3 SAMPLE1 1885 253 1884 12 12
#4 SAMPLE1 1886 253 1885 NA NA
#5 SAMPLE1 1887 253 1886 NA N A
#6 SAMPLE1 1888 253 1887 NA NA
#7 SAMPLE1 1889 253 1888 NA 12
#8 SAMPLE1 1890 253 1889 17 29
#9 SAMPLE2 1991 261 1991 0 0
#10 SAMPLE2 1992 261 1991 -19 -19
#11 SAMPLE2 1993 261 1992 -58 -77
#12 SAMPLE2 1994 261 1992 -58 -77
#13 SAMPLE2 1995 261 1994 -40 -117
It was requested that I would break down my question asked here (R: Applying cumulative sum function and filling data gaps with NA for plotting) a little and post a smaller sample. Here it is and here you can find my sample data: https://dl.dropboxusercontent.com/u/16277659/inputdata.csv
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE
SAMPLE1; 253; 1883; 1883; 0
SAMPLE1; 253; 1884; 1883; NA
SAMPLE1; 253; 1885; 1884; 12
SAMPLE1; 253; 1890; 1889; 17
SAMPLE2; 261; 1991; 1991; 0
SAMPLE2; 261; 1992; 1991; -19
SAMPLE2; 261; 1994; 1992; -58
SAMPLE2; 261; 1995; 1994; -40
I would like to calculate the cumulative sum for the column VALUE and fill up the data gaps for the years inbetween with NA values (the structure of the data should be the same, as I need the other columns for further processing).
When filling up the data gaps NAs should be filled in like in SAMPLE1. Please note the position of the values after NA when filling in multiple NAs in the column CUMSUM (e.g. the last CUMSUM value should be filled in besides the last NA in VALUE (used for plotting reasons).
An exception is the case when the period between REFERENCE_YEAR and SURVEY_YEAR is greater than one year, the value should be written into the column like in SAMPLE2 for the period 1992 to 1994.
This is only a sample dataset, my actual dataset consists of several columns and of about 40000 rows. Best would be a solution in BaseR. The REFERENCE_YEAR and SURVEY_YEAR being equal in the first row for each SAMPLE is the result of the code I use for writing a zero column for each group.
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE; CUMSUM
SAMPLE1; 253; 1883; 1883; 0; 0
SAMPLE1; 253; 1884; 1883; NA; NA
SAMPLE1; 253; 1885; 1884; 12; 12
SAMPLE1; 253; 1886; 1885; NA; NA
SAMPLE1; 253; 1887; 1886; NA; NA
SAMPLE1; 253; 1888; 1887; NA; NA
SAMPLE1; 253; 1889; 1888; NA; 12
SAMPLE1; 253; 1890; 1889; 17; 29
SAMPLE2; 261; 1991; 1991; 0; 0
SAMPLE2; 261; 1992; 1991; -19; -19
SAMPLE2; 261; 1993; 1992; -58; -77
SAMPLE2; 261; 1994; 1992; -58; -77
SAMPLE2; 261; 1995; 1994; -40; -117
--------------------------------------------------------------------------------------------
If dat
is the dataset, one way would be:
Create a new dataset by expanding between minimum and maximum SURVEY_YEAR
for each NAME
dat1 <- setNames(stack(
with(dat, tapply(SURVEY_YEAR, NAME,
FUN=function(x) seq(min(x), max(x)))))[2:1], c("NAME", "SURVEY_YEAR"))
Merge the new dataset dat1
with old dat
datN <- merge(dat1, dat, all=TRUE)
Replace the missing values in REFERENCE_YEAR
by SURVEY_YEAR
from the previous row
datN$REFERENCE_YEAR[is.na(datN$REFERENCE_YEAR)] <- datN$SURVEY_YEAR[which(is.na(datN$REFERENCE_YEAR))-1]
Use na.locf
from zoo
to fill the NA's for ID
library(zoo)
datN$ID <- na.locf(datN$ID)
datN$CUMSUM <- NA
Do cumsum
on the non-NA VALUE
rows and
datN$CUMSUM[!is.na(datN$VALUE)] <- unlist(with(datN, tapply(VALUE, NAME, FUN=function(x) cumsum(x[!is.na(x)]))))
Look for rows having a difference between SURVEY_YEAR and REFERENCE_YEAR >1
indx <- with(datN, SURVEY_YEAR-REFERENCE_YEAR)>1
Replace those rows in VALUE
and CUMSUM
columns with the next row values
datN[,c("VALUE", "CUMSUM")] <- lapply(datN[,c("VALUE", "CUMSUM")], function(x) {x[which(indx)-1] <- x[indx]; x})
Change some of the NA
values in CUMSUM
to previous non-NA
value
datN$CUMSUM <- with(datN, ave(CUMSUM, NAME, FUN = function(x) {
x1 <- is.na(x)
rl <- rle(x1)
indx <- which(!(!(abs(x1 - 1) * (cumsum(x1) != 0) * sequence(rl$lengths)))) - 1
indx1 <- indx[indx - c(1, indx[-length(indx)]) > 1]
indxn <- unlist(lapply(indx1, function(y) {
indx2 <- which(!is.na(x))
tail(indx2[which(indx2 < y)], 1)
}))
x[indx1] <- x[indxn]
x
}))
datN
# NAME SURVEY_YEAR ID REFERENCE_YEAR VALUE CUMSUM
#1 SAMPLE1 1883 253 1883 0 0
#2 SAMPLE1 1884 253 1883 NA NA
#3 SAMPLE1 1885 253 1884 12 12
#4 SAMPLE1 1886 253 1885 NA NA
#5 SAMPLE1 1887 253 1886 NA NA
#6 SAMPLE1 1888 253 1887 NA NA
#7 SAMPLE1 1889 253 1888 NA 12
#8 SAMPLE1 1890 253 1889 17 29
#9 SAMPLE2 1991 261 1991 0 0
#10 SAMPLE2 1992 261 1991 -19 -19
#11 SAMPLE2 1993 261 1992 -58 -77
#12 SAMPLE2 1994 261 1992 -58 -77
#13 SAMPLE2 1995 261 1994 -40 -117
这篇关于R:用NAs填充数据空白并应用cumsum函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!