如何用 na.spline 替换 data.table 中的 NA 值 [英] How to replace NA values in a data.table with na.spline
问题描述
我正在尝试准备一些从 Eurostat 检索到的人口统计数据以供进一步处理,其中包括用相应的近似数据替换任何缺失的数据.
I'm trying to prepare some demographic data retrieved from Eurostat for further processing, amongst others replacing any missing data with corresponding approximated ones.
起初我只使用 data.frames,但后来我确信 data.tables 可能比常规 data.frames 提供一些优势,所以我迁移到 data.tables.
First I was using data.frames only, but then I got convinced that data.tables might offer some advantages over regular data.frames, so I migrated to data.tables.
我在这样做时观察到的一件事是,将na.spline"与apply"结合使用时得到不同的结果,而na.spline"作为数据表的一部分.
One thing I've observed while doing so was getting different results when using "na.spline" in combination with "apply" versus "na.spline" as part of the data.table.
#1 源数据
(dt0 <- data.table(
"age,sex,geo\\time" = c("TOTAL,F,AD", "TOTAL,F,AL", "TOTAL,F,AM", "TOTAL,F,AT", "TOTAL,F,AZ"),
"2014" = c(NA, NA, NA, 4351253, NA),
"2013" = c(37408, NA, NA, 4328238, 4707690),
"2012" = c(38252, NA, 1684000, 4309977, 4651601),
"2011" = c(38252, 1409931, 1679066, 4296293, 4594023),
"2010" = c(40296, NA, 1673656, 4285442, 4542083)
))
哪个生成
age,sex,geo\\time 2014 2013 2012 2011 2010
1: TOTAL,F,AD NA 37408 38252 38252 40296
2: TOTAL,F,AL NA NA NA 1409931 NA
3: TOTAL,F,AM NA NA 1684000 1679066 1673656
4: TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
5: TOTAL,F,AZ NA 4707690 4651601 4594023 4542083
拆分为行标题"...
(dt0a <- dt0[, 1, with=FALSE])
(cn0a <- colnames(dt0a))
...和人口数据...
...and population data...
(dt0b <- dt0[, 2:ncol(dt0), with=FALSE])
(cn0ba <- colnames(dt0b))
#2 将 NA 替换为 "na.spline" &申请"
#2 replace NAs by "na.spline" & "apply"
(dt1b <- data.table(t(apply(dt0b, 1, na.spline, na.rm=FALSE))))
(setnames(dt1b, cn0b))
(dt1 <- cbind(dt0a, dt1b))
结果是……
age,sex,geo\\time 2014 2013 2012 2011 2010
1: TOTAL,F,AD 32832 37408 38252 38252 40296
2: TOTAL,F,AL 1409931 1409931 1409931 1409931 1409931
3: TOTAL,F,AM 1692440 1688458 1684000 1679066 1673656
4: TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
5: TOTAL,F,AZ 4755163 4707690 4651601 4594023 4542083
#3 替换data.table"中的 NA
(dt2b <- dt0b[,lapply(.SD, na.spline, na.rm=FALSE)])
(dt2 <- cbind(dt0a, dt2b))
等等……
age,sex,geo\\time 2014 2013 2012 2011 2010
1: TOTAL,F,AD 4351253 37408 38252.0 38252 40296.0
2: TOTAL,F,AL 4351253 1993097 -611513.8 1409931 -629363.2
3: TOTAL,F,AM 4351253 3423374 1684000.0 1679066 1673656.0
4: TOTAL,F,AT 4351253 4328238 4309977.0 4296293 4285442.0
5: TOTAL,F,AZ 4351253 4707690 4651601.0 4594023 4542083.0
#4 比较结果
(identical(dt1, dt2))
考虑到上述情况并不令人意外......
Considering the aforementioned not quite a surprise...
[1] FALSE
(用方法 #2 代替 NA 计算的值是我感兴趣的值,仅通过方法 #3 生成).
(The values calculated as a replacement for NAs with approach #2 are the ones I'd be interested in, only generated via approach #3).
追求data.table"路线(方法#3)的原因之一是性能(正如在各种帖子中指出的那样,当使用apply"时,正在执行矩阵运算,这需要与仅促进data.table"的相应方法相比,时间要长得多).
Reason for pursuing the "data.table" route (approach #3) is one of performance (as it has been pointed out in various posts, when using "apply" a matrix operation is getting carried out, which is taking a considerably longer time than a corresponding approach facilitating "data.table" only).
作为 R 的新手,我认为我做了一些根本性的错误,唯一的问题是,我对这可能是什么一无所知.
Being pretty new to R I reckon I've done something quite fundamentally wrong, only thing is, I haven't got the faintest idea what this might be.
为我指明正确方向的任何帮助都非常感谢!
Any help pointing me in the right direction is more than appreciated!
-Sil68
推荐答案
使用矩阵.在 matrix
上使用矩阵运算并不慢:
Using a matrix. Using a matrix operation on a matrix
is not slow:
mat <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
mat[] <- na.spline(mat,na.rm=FALSE)
给出
TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
2014 32832 1409931 1692440 4351253 4755163
2013 37408 1409931 1688458 4328238 4707690
2012 38252 1409931 1684000 4309977 4651601
2011 38252 1409931 1679066 4296293 4594023
2010 40296 1409931 1673656 4285442 4542083
<小时>
使用 data.table.如果您想使用 data.table
,请执行
Using a data.table. If you instead want to use a data.table
, do
mat <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
DT <- data.table(mat,keep.rownames=TRUE)
DT[,(vn):=lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn]
通过引用更新DT
,给出
rn TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
1: 2014 32832 1409931 1692440 4351253 4755163
2: 2013 37408 1409931 1688458 4328238 4707690
3: 2012 38252 1409931 1684000 4309977 4651601
4: 2011 38252 1409931 1679066 4296293 4594023
5: 2010 40296 1409931 1673656 4285442 4542083
<小时>
基准测试:
mat <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
DT <- data.table(mat,keep.rownames=TRUE)
vn <- names(DT)[-1]
tvn <- names(dt0)[-1]
require(microbenchmark)
microbenchmark(
transp = dt0[,as.list(na.spline(unlist(.SD), na.rm=FALSE)),by=1:nrow(dt0),.SDcols=tvn],
lapply = DT[,lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn],
apply = apply(mat,2,na.spline,na.rm=FALSE),
fun = na.spline(mat,na.rm=FALSE),
times=10)
结果:
Unit: milliseconds
expr min lq mean median uq max neval
transp 4.666934 4.734891 4.850268 4.787690 4.897202 5.259957 10
lapply 3.923823 4.010356 4.327646 4.039445 4.049957 6.976446 10
apply 2.505556 2.525601 2.578890 2.585978 2.592090 2.758801 10
fun 1.945290 1.994178 2.063063 2.068490 2.085112 2.272846 10
transp"结果显示了@shadow 解决方案的时间,它保留了 OP 的格式.由于 na.spline
的工作原理,这里不需要 apply
.
The "transp" result shows the timing of @shadow's solution, which preserves the OP's formatting. apply
is not necessary here, thanks to how na.spline
works.
这篇关于如何用 na.spline 替换 data.table 中的 NA 值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!