如何用na.spline替换data.table中的NA值 [英] How to replace NA values in a data.table with na.spline

查看:131
本文介绍了如何用na.spline替换data.table中的NA值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试准备从欧盟统计局(Eurostat)检索到的一些人口统计数据,以进行进一步处理,其中包括用相应的近似数据替换所有缺失的数据.

首先我只使用data.frames,但是后来我确信data.tables可能比常规data.frames更具优势,所以我迁移到了data.tables.

我观察到的一件事是,将"na.spline"与"apply"与"na.spline"结合使用作为data.table的一部分时得到了不同的结果.

#1源数据

 (dt0<-data.table(年龄,性别,地理位置\时间" = c(总计,F,AD",总计,F,AL",总计,F,AM",总计,F,AT",总计,F,AZ"),"2014" = c(NA,NA,NA,4351253,NA),"2013"​​ = c(37408,NA,NA,4328238,4707690),"2012" = c(38252,NA,1684000,4309977,4651601),"2011" = c(38252,1409931,1679066,4296293,4594023),"2010" = c(40296,NA,1673656,4285442,4542083))) 

哪个生成

 年龄,性别,地理\时间2014 2013 2012 2011 20101:TOTAL,F,AD不适用37408 38252 38252 402962:TOTAL,F,AL不适用不适用不适用1409931不适用3:总计,F,AM NA NA 1684000 1679066 16736564:TOTAL,F,AT 4351253 4328238 4309977 4296293 42854425:TOTAL,F,AZ不适用4707690 4651601 4594023 4542083 

分成行标题" ...

 (dt0a<-dt0 [,1,with = FALSE])(cn0a<-colnames(dt0a)) 

...以及人口数据...

 (dt0b<-dt0 [,2:ncol(dt0),with = FALSE])(cn0ba<-colnames(dt0b)) 

#2用"na.spline"替换 NA s&应用"

 (dt1b<-data.table(t(apply(dt0b,1,na.spline,na.rm = FALSE))))(setnames(dt1b,cn0b))(dt1<-cbind(dt0a,dt1b)) 

这将导致...

 年龄,性别,地理\时间2014 2013 2012 2011 20101:TOTAL,F,AD 32832 37408 38252 38252 402962:TOTAL,F,AL 1409931 1409931 1409931 1409931 14099313:总,F,AM 1692440 1688458 1684000 1679066 16736564:TOTAL,F,AT 4351253 4328238 4309977 4296293 42854425:TOTAL,F,AZ 4755163 4707690 4651601 4594023 4542083 

#3替换"data.table"中的 NA

 (dt2b<-dt0b [,lapply(.SD,na.spline,na.rm = FALSE)])(dt2<-cbind(dt0a,dt2b)) 

等等...

 年龄,性别,地理\时间2014 2013 2012 2011 20101:TOTAL,F,AD 4351253 37408 38252.0 38252 40296.02:TOTAL,F,AL 4351253 1993097 -611513.8 1409931 -629363.23:总,F,AM 4351253 3423374 1684000.0 1679066 1673656.04:TOTAL,F,AT 4351253 4328238 4309977.0 4296293 4285442.05:TOTAL,F,AZ 4351253 4707690 4651601.0 4594023 4542083.0 

#4比较结果

 (相同(dt1,dt2)) 

考虑到上述不足为奇...

  [1]否 

(用方法2代替 NA 的值是我感兴趣的值,仅通过方法3生成).

追求"data.table"路由(方法3)的原因是一种性能(正如在各种帖子中所指出的那样,当使用"apply"时,正在执行矩阵运算,这需要花费大量时间).比仅使用"data.table"的相应方法要长得多的时间.

我对R相当陌生,我认为我做了一些根本上是错误的事情,唯一的是,我还没有最清楚的主意是什么.

任何向我指出正确方向的帮助都倍受赞赏!

-Sil68

解决方案

使用矩阵.在 matrix 上使用矩阵运算并不慢:

  mat<-t(as.matrix(dt0 [,-1]))colnames(mat)<-dt0 [[1]]mat []<-na.spline(mat,na.rm = FALSE) 

给出

  TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ2014 32832 1409931 1692440 4351253 47551632013 37408 1409931 1688458 4328238 47076902012 38252 1409931 1684000 4309977 46516012011 38252 1409931 1679066 4296293 45940232010 40296 1409931 1673656 4285442 4542083 


使用数据表.如果您想使用 data.table ,请

  mat<-t(as.matrix(dt0 [,-1]))colnames(mat)<-dt0 [[1]]DT<-data.table(mat,keep.rownames = TRUE)DT [,(vn):= lapply(.SD,na.spline,na.rm = FALSE),. SDcols = vn] 

通过引用更新 DT ,给出

  rn总计,F,AD总计,F,AL总计,F,AM总计,F,AT总计,F,AZ1:2014年32832 1409931 1692440 4351253 47551632:2013 37408 1409931 1688458 4328238 47076903:2012 38252 1409931 1684000 4309977 46516014:2011 38252 1409931 1679066 4296293 45940235:2010 40296 1409931 1673656 4285442 4542083 


基准化:

  mat<-t(as.matrix(dt0 [,-1]))colnames(mat)<-dt0 [[1]]DT<-data.table(mat,keep.rownames = TRUE)vn<-名称(DT)[-1]tvn<-名称(dt0)[-1]要求(微基准测试)微基准transp = dt0 [,as.list(na.spline(unlist(.SD),na.rm = FALSE)),by = 1:nrow(dt0),. SDcols = tvn],lapply = DT [,lapply(.SD,na.spline,na.rm = FALSE),. SDcols = vn],apply = apply(mat,2,na.spline,na.rm = FALSE),fun = na.spline(mat,na.rm = FALSE),次= 10) 

结果:

 单位:毫秒expr min lq平均中位数uq max neval运输4.666934 4.734891 4.850268 4.787690 4.897202 5.259957 10拉普利3.923823 4.010356 4.327646 4.039445 4.049957 6.976446 10适用2.505556 2.525601 2.578890 2.585978 2.592090 2.758801 10乐趣1.945290 1.994178 2.063063 2.068490 2.085112 2.272846 10 

"transp"结果显示@shadow解决方案的时间安排,该解决方案保留了OP的格式.由于 na.spline 的工作原理,此处不需要 apply .

I'm trying to prepare some demographic data retrieved from Eurostat for further processing, amongst others replacing any missing data with corresponding approximated ones.

First I was using data.frames only, but then I got convinced that data.tables might offer some advantages over regular data.frames, so I migrated to data.tables.

One thing I've observed while doing so was getting different results when using "na.spline" in combination with "apply" versus "na.spline" as part of the data.table.

#1 source data

(dt0 <- data.table(
            "age,sex,geo\\time" = c("TOTAL,F,AD", "TOTAL,F,AL", "TOTAL,F,AM", "TOTAL,F,AT", "TOTAL,F,AZ"),
            "2014" = c(NA,    NA,      NA,      4351253, NA),
            "2013" = c(37408, NA,      NA,      4328238, 4707690),
            "2012" = c(38252, NA,      1684000, 4309977, 4651601),
            "2011" = c(38252, 1409931, 1679066, 4296293, 4594023),
            "2010" = c(40296, NA,      1673656, 4285442, 4542083)
        ))

Which generates

       age,sex,geo\\time    2014    2013    2012    2011    2010
    1:        TOTAL,F,AD      NA   37408   38252   38252   40296
    2:        TOTAL,F,AL      NA      NA      NA 1409931      NA
    3:        TOTAL,F,AM      NA      NA 1684000 1679066 1673656
    4:        TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
    5:        TOTAL,F,AZ      NA 4707690 4651601 4594023 4542083

Split into "row headings"...

(dt0a <- dt0[, 1, with=FALSE])
(cn0a <- colnames(dt0a))

...and population data...

(dt0b <- dt0[, 2:ncol(dt0), with=FALSE])
(cn0ba <- colnames(dt0b))

#2 replace NAs by "na.spline" & "apply"

(dt1b <- data.table(t(apply(dt0b, 1, na.spline, na.rm=FALSE))))
(setnames(dt1b, cn0b))
(dt1 <- cbind(dt0a, dt1b))

Which results in...

       age,sex,geo\\time    2014    2013    2012    2011    2010
    1:        TOTAL,F,AD   32832   37408   38252   38252   40296
    2:        TOTAL,F,AL 1409931 1409931 1409931 1409931 1409931
    3:        TOTAL,F,AM 1692440 1688458 1684000 1679066 1673656
    4:        TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
    5:        TOTAL,F,AZ 4755163 4707690 4651601 4594023 4542083

#3 replace NAs within "data.table"

(dt2b <- dt0b[,lapply(.SD, na.spline, na.rm=FALSE)])
(dt2 <- cbind(dt0a, dt2b))

Et voila...

       age,sex,geo\\time    2014    2013      2012    2011      2010
    1:        TOTAL,F,AD 4351253   37408   38252.0   38252   40296.0
    2:        TOTAL,F,AL 4351253 1993097 -611513.8 1409931 -629363.2
    3:        TOTAL,F,AM 4351253 3423374 1684000.0 1679066 1673656.0
    4:        TOTAL,F,AT 4351253 4328238 4309977.0 4296293 4285442.0
    5:        TOTAL,F,AZ 4351253 4707690 4651601.0 4594023 4542083.0

#4 compare results

(identical(dt1, dt2))

Considering the aforementioned not quite a surprise...

    [1] FALSE

(The values calculated as a replacement for NAs with approach #2 are the ones I'd be interested in, only generated via approach #3).

Reason for pursuing the "data.table" route (approach #3) is one of performance (as it has been pointed out in various posts, when using "apply" a matrix operation is getting carried out, which is taking a considerably longer time than a corresponding approach facilitating "data.table" only).

Being pretty new to R I reckon I've done something quite fundamentally wrong, only thing is, I haven't got the faintest idea what this might be.

Any help pointing me in the right direction is more than appreciated!

-Sil68

解决方案

Using a matrix. Using a matrix operation on a matrix is not slow:

mat           <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
mat[]         <- na.spline(mat,na.rm=FALSE)

which gives

     TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
2014      32832    1409931    1692440    4351253    4755163
2013      37408    1409931    1688458    4328238    4707690
2012      38252    1409931    1684000    4309977    4651601
2011      38252    1409931    1679066    4296293    4594023
2010      40296    1409931    1673656    4285442    4542083


Using a data.table. If you instead want to use a data.table, do

mat           <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
DT            <- data.table(mat,keep.rownames=TRUE)
DT[,(vn):=lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn]

which updates DT by reference, giving

     rn TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
1: 2014      32832    1409931    1692440    4351253    4755163
2: 2013      37408    1409931    1688458    4328238    4707690
3: 2012      38252    1409931    1684000    4309977    4651601
4: 2011      38252    1409931    1679066    4296293    4594023
5: 2010      40296    1409931    1673656    4285442    4542083


Benchmarking:

mat           <- t(as.matrix(dt0[,-1]))
colnames(mat) <- dt0[[1]]
DT            <- data.table(mat,keep.rownames=TRUE)
vn            <- names(DT)[-1]
tvn           <- names(dt0)[-1]

require(microbenchmark)
microbenchmark(
  transp = dt0[,as.list(na.spline(unlist(.SD), na.rm=FALSE)),by=1:nrow(dt0),.SDcols=tvn],
  lapply = DT[,lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn],
  apply  = apply(mat,2,na.spline,na.rm=FALSE),
  fun    = na.spline(mat,na.rm=FALSE),
times=10)

Results:

Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval
 transp 4.666934 4.734891 4.850268 4.787690 4.897202 5.259957    10
 lapply 3.923823 4.010356 4.327646 4.039445 4.049957 6.976446    10
  apply 2.505556 2.525601 2.578890 2.585978 2.592090 2.758801    10
    fun 1.945290 1.994178 2.063063 2.068490 2.085112 2.272846    10

The "transp" result shows the timing of @shadow's solution, which preserves the OP's formatting. apply is not necessary here, thanks to how na.spline works.

这篇关于如何用na.spline替换data.table中的NA值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆