如何用 na.spline 替换 data.table 中的 NA 值 [英] How to replace NA values in a data.table with na.spline

查看:26
本文介绍了如何用 na.spline 替换 data.table 中的 NA 值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试准备一些从 Eurostat 检索到的人口统计数据以供进一步处理,其中包括用相应的近似数据替换任何缺失的数据.

I'm trying to prepare some demographic data retrieved from Eurostat for further processing, amongst others replacing any missing data with corresponding approximated ones.

起初我只使用 data.frames,但后来我确信 data.tables 可能比常规 data.frames 提供一些优势,所以我迁移到 data.tables.

First I was using data.frames only, but then I got convinced that data.tables might offer some advantages over regular data.frames, so I migrated to data.tables.

我在这样做时观察到的一件事是,将na.spline"与apply"结合使用时得到不同的结果,而na.spline"作为数据表的一部分.

One thing I've observed while doing so was getting different results when using "na.spline" in combination with "apply" versus "na.spline" as part of the data.table.

#1 源数据

(dt0 <- data.table(
            "age,sex,geo\\time" = c("TOTAL,F,AD", "TOTAL,F,AL", "TOTAL,F,AM", "TOTAL,F,AT", "TOTAL,F,AZ"),
            "2014" = c(NA,    NA,      NA,      4351253, NA),
            "2013" = c(37408, NA,      NA,      4328238, 4707690),
            "2012" = c(38252, NA,      1684000, 4309977, 4651601),
            "2011" = c(38252, 1409931, 1679066, 4296293, 4594023),
            "2010" = c(40296, NA,      1673656, 4285442, 4542083)
        ))

哪个生成

       age,sex,geo\\time    2014    2013    2012    2011    2010
    1:        TOTAL,F,AD      NA   37408   38252   38252   40296
    2:        TOTAL,F,AL      NA      NA      NA 1409931      NA
    3:        TOTAL,F,AM      NA      NA 1684000 1679066 1673656
    4:        TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
    5:        TOTAL,F,AZ      NA 4707690 4651601 4594023 4542083

拆分为行标题"...

(dt0a <- dt0[, 1, with=FALSE])
(cn0a <- colnames(dt0a))

...和人口数据...

...and population data...

(dt0b <- dt0[, 2:ncol(dt0), with=FALSE])
(cn0ba <- colnames(dt0b))

#2 将 NA 替换为 "na.spline" &申请"

#2 replace NAs by "na.spline" & "apply"

(dt1b <- data.table(t(apply(dt0b, 1, na.spline, na.rm=FALSE))))
(setnames(dt1b, cn0b))
(dt1 <- cbind(dt0a, dt1b))

结果是……

       age,sex,geo\\time    2014    2013    2012    2011    2010
    1:        TOTAL,F,AD   32832   37408   38252   38252   40296
    2:        TOTAL,F,AL 1409931 1409931 1409931 1409931 1409931
    3:        TOTAL,F,AM 1692440 1688458 1684000 1679066 1673656
    4:        TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
    5:        TOTAL,F,AZ 4755163 4707690 4651601 4594023 4542083

#3 替换data.table"中的 NA

(dt2b <- dt0b[,lapply(.SD, na.spline, na.rm=FALSE)])
(dt2 <- cbind(dt0a, dt2b))

等等……

       age,sex,geo\\time    2014    2013      2012    2011      2010
    1:        TOTAL,F,AD 4351253   37408   38252.0   38252   40296.0
    2:        TOTAL,F,AL 4351253 1993097 -611513.8 1409931 -629363.2
    3:        TOTAL,F,AM 4351253 3423374 1684000.0 1679066 1673656.0
    4:        TOTAL,F,AT 4351253 4328238 4309977.0 4296293 4285442.0
    5:        TOTAL,F,AZ 4351253 4707690 4651601.0 4594023 4542083.0

#4 比较结果

(identical(dt1, dt2))

考虑到上述情况并不令人意外......

Considering the aforementioned not quite a surprise...

    [1] FALSE

(用方法 #2 代替 NA 计算的值是我感兴趣的值,仅通过方法 #3 生成).

(The values calculated as a replacement for NAs with approach #2 are the ones I'd be interested in, only generated via approach #3).

追求data.table"路线(方法#3)的原因之一是性能(正如在各种帖子中指出的那样,当使用apply"时,正在执行矩阵运算,这需要与仅促进data.table"的相应方法相比,时间要长得多).

Reason for pursuing the "data.table" route (approach #3) is one of performance (as it has been pointed out in various posts, when using "apply" a matrix operation is getting carried out, which is taking a considerably longer time than a corresponding approach facilitating "data.table" only).

作为 R 的新手,我认为我做了一些根本性的错误,唯一的问题是,我对这可能是什么一无所知.

Being pretty new to R I reckon I've done something quite fundamentally wrong, only thing is, I haven't got the faintest idea what this might be.

为我指明正确方向的任何帮助都非常感谢!

Any help pointing me in the right direction is more than appreciated!

-Sil68

推荐答案

使用矩阵.在 matrix 上使用矩阵运算并不慢:

Using a matrix. Using a matrix operation on a matrix is not slow:

mat           <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
mat[]         <- na.spline(mat,na.rm=FALSE)

给出

     TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
2014      32832    1409931    1692440    4351253    4755163
2013      37408    1409931    1688458    4328238    4707690
2012      38252    1409931    1684000    4309977    4651601
2011      38252    1409931    1679066    4296293    4594023
2010      40296    1409931    1673656    4285442    4542083

<小时>

使用 data.table.如果您想使用 data.table,请执行


Using a data.table. If you instead want to use a data.table, do

mat           <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
DT            <- data.table(mat,keep.rownames=TRUE)
DT[,(vn):=lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn]

通过引用更新DT,给出

     rn TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
1: 2014      32832    1409931    1692440    4351253    4755163
2: 2013      37408    1409931    1688458    4328238    4707690
3: 2012      38252    1409931    1684000    4309977    4651601
4: 2011      38252    1409931    1679066    4296293    4594023
5: 2010      40296    1409931    1673656    4285442    4542083

<小时>

基准测试:

mat           <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
DT            <- data.table(mat,keep.rownames=TRUE)
vn            <- names(DT)[-1]
tvn           <- names(dt0)[-1]

require(microbenchmark)
microbenchmark(
  transp = dt0[,as.list(na.spline(unlist(.SD), na.rm=FALSE)),by=1:nrow(dt0),.SDcols=tvn],
  lapply = DT[,lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn],
  apply  = apply(mat,2,na.spline,na.rm=FALSE),
  fun    = na.spline(mat,na.rm=FALSE),
times=10)

结果:

Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval
 transp 4.666934 4.734891 4.850268 4.787690 4.897202 5.259957    10
 lapply 3.923823 4.010356 4.327646 4.039445 4.049957 6.976446    10
  apply 2.505556 2.525601 2.578890 2.585978 2.592090 2.758801    10
    fun 1.945290 1.994178 2.063063 2.068490 2.085112 2.272846    10

transp"结果显示了@shadow 解决方案的时间,它保留了 OP 的格式.由于 na.spline 的工作原理,这里不需要 apply.

The "transp" result shows the timing of @shadow's solution, which preserves the OP's formatting. apply is not necessary here, thanks to how na.spline works.

这篇关于如何用 na.spline 替换 data.table 中的 NA 值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆