应用于 data.table 的行:查找列的子集都是 NA 的行 [英] Apply over rows of data.table: find rows where a subset of columns are all NA

查看:24
本文介绍了应用于 data.table 的行:查找列的子集都是 NA 的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 data.table 包重写旧的(慢)代码,以找出将 apply 与数据一起使用的最佳方法.桌子.

I am trying, in my quest to rewrite old (slow) code with the data.table package, to figure out the best way to use apply with a data.table.

我有一个包含多个 id 列的 data.table,然后是多个具有宽格式剂量反应数据的列.我需要概括答案,因为并非所有 data.tables 都具有相同数量的剂量反应列.为简单起见,我认为以下 data.table 解决了这个问题:

I have a data.table with multiple id columns, then multiple columns that have dose-response data in a wide format. I need to generalize the answer because not all data.tables will have the same number of dose-response columns. For simplicity I think the following data.table addresses the issue:

library(data.table)
library(microbenchmark)
set.seed(1234)
DT1 =  data.table(unique_id = paste0('id',1:1e6),
                 dose1 = sample(c(1:9,NA),1e6,replace=TRUE),
                 dose2 = sample(c(1:9,NA),1e6,replace=TRUE)
                 )

> DT1
          unique_id dose1 dose2
       1:       id1     2     2
       2:       id2     7     4
       3:       id3     7     9
       4:       id4     7     4
       5:       id5     9     3
---                      
  999996:  id999996     4     3
  999997:  id999997    NA     3
  999998:  id999998     4     2
  999999:  id999999     8     5
 1000000: id1000000     6     7

所以每一行都有一个唯一的 id,一些其他的 id,我省略了响应列,因为它们将是 NA,其中剂量列是 NA.我需要做的是删除所有剂量列都是 NA 的行.我想出了第一个选项,然后意识到我可以将其缩减为第二个选项.

So each row has a unique id, some other ids, and I have left out the response columns, because they will be NA where the dose columns are NA. What I need to do is remove rows where all of the dose columns are NA. I came up with the first option, then realized I could trim it down to the second option.

DT2 <- copy(DT1)
DT3 <- copy(DT1)

len.not.na <- function(x){length(which(!is.na(x)))}

option1 <- function(DT){
  DT[,flag := apply(.SD,1,len.not.na),.SDcols=grep("dose",colnames(DT))]
  DT <- DT[flag != 0]
  DT[ , flag := NULL ]
}

option2 <- function(DT){
  DT[ apply(DT[,grep("dose",colnames(DT)),with=FALSE],1,len.not.na) != 0 ]
}

> microbenchmark(op1 <- option1(DT2), op2 <- option2(DT3),times=25L)
Unit: seconds
                expr      min       lq   median       uq      max neval
 op1 <- option1(DT2) 8.364504 8.863436 9.145341 11.27827 11.50356    25
 op2 <- option2(DT3) 8.291549 8.774746 8.982536 11.15269 11.72199    25

很明显,他们两个选项做同样的事情,选项 1 有更多的步骤,但我想测试如何调用 .SD 可能会像其他帖子所建议的那样减慢速度(例如).

Clearly they two options do about the same thing, with option 1 having a few more steps, but I wanted to test how calling .SD might slow things down as has been suggested by other posts (for example).

无论哪种方式,这两个选项仍然处于缓慢的一面.有什么加快速度的建议吗?

Either way both options are still on the slow side. Any suggestions to speeding things up?

使用@AnandaMahto 的评论进行编辑

DT4 <- copy(DT1)
option3 <- function(DT){
  DT[rowSums(DT[,grep("dose",colnames(DT)),with=FALSE]) != 0]
}

> microbenchmark(op2 <- option2(DT3), op3 <- option3(DT4),times=5L)
Unit: milliseconds
               expr        min         lq    median        uq       max neval
op2 <- option2(DT3) 7738.21094 7810.87777 7838.6067 7969.5543 8407.4069     5
op3 <- option3(DT4)   83.78921   92.65472  320.6273  559.8153  783.0742     5

rowSums 绝对更快.我对解决方案很满意,除非有人有更快的解决方案.

rowSums is definitely faster. I am happy with the solution unless anyone has something faster.

推荐答案

我的方法如下:

使用 rowSums 查找要保留的行:

Use rowSums to find the rows you want to keep:

Dose <- grep("dose", colnames(DT1))
# .. menas "up one level
Flag <- rowSums(is.na(DT1[, ..Dose])) != length(Dose)
DT1[Flag]

这篇关于应用于 data.table 的行:查找列的子集都是 NA 的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆