应用data.table的行:查找列的子集都为NA的行 [英] Apply over rows of data.table: find rows where a subset of columns are all NA
问题描述
我在试图用 data.table
包重写旧的(慢)代码,找出使用使用data.table应用
。
I am trying, in my quest to rewrite old (slow) code with the data.table
package, to figure out the best way to use apply
with a data.table.
我有一个data.table具有多个id列,然后多个列具有宽格式的剂量响应数据。我需要归纳答案,因为不是所有的data.tables将有相同数量的剂量反应列。为了简单起见,我认为以下data.table解决了这个问题:
I have a data.table with multiple id columns, then multiple columns that have dose-response data in a wide format. I need to generalize the answer because not all data.tables will have the same number of dose-response columns. For simplicity I think the following data.table addresses the issue:
library(data.table)
library(microbenchmark)
set.seed(1234)
DT1 = data.table(unique_id = paste0('id',1:1e6),
dose1 = sample(c(1:9,NA),1e6,replace=TRUE),
dose2 = sample(c(1:9,NA),1e6,replace=TRUE)
)
> DT1
unique_id dose1 dose2
1: id1 2 2
2: id2 7 4
3: id3 7 9
4: id4 7 4
5: id5 9 3
---
999996: id999996 4 3
999997: id999997 NA 3
999998: id999998 4 2
999999: id999999 8 5
1000000: id1000000 6 7
因此,每行都有一个唯一的id,其他的id,因为它们将 NA
其中剂量列是 NA
。我需要做的是删除所有剂量列 NA
的行。我想出了第一个选项,然后意识到我可以修剪它到第二个选项。
So each row has a unique id, some other ids, and I have left out the response columns, because they will be NA
where the dose columns are NA
. What I need to do is remove rows where all of the dose columns are NA
. I came up with the first option, then realized I could trim it down to the second option.
DT2 <- copy(DT1)
DT3 <- copy(DT1)
len.not.na <- function(x){length(which(!is.na(x)))}
option1 <- function(DT){
DT[,flag := apply(.SD,1,len.not.na),.SDcols=grep("dose",colnames(DT))]
DT <- DT[flag != 0]
DT[ , flag := NULL ]
}
option2 <- function(DT){
DT[ apply(DT[,grep("dose",colnames(DT)),with=FALSE],1,len.not.na) != 0 ]
}
> microbenchmark(op1 <- option1(DT2), op2 <- option2(DT3),times=25L)
Unit: seconds
expr min lq median uq max neval
op1 <- option1(DT2) 8.364504 8.863436 9.145341 11.27827 11.50356 25
op2 <- option2(DT3) 8.291549 8.774746 8.982536 11.15269 11.72199 25
显然,他们两个选项做同样的事情,选项1有几个步骤,但我想测试如何调用 .SD
可能会减慢按照其他帖子的建议(例如)。
Clearly they two options do about the same thing, with option 1 having a few more steps, but I wanted to test how calling .SD
might slow things down as has been suggested by other posts (for example).
无论哪种方式,两个选项都仍然缓慢。任何有关加速的建议?
Either way both options are still on the slow side. Any suggestions to speeding things up?
编辑来自@AnandaMahto的评论
DT4 <- copy(DT1)
option3 <- function(DT){
DT[rowSums(DT[,grep("dose",colnames(DT)),with=FALSE]) != 0]
}
> microbenchmark(op2 <- option2(DT3), op3 <- option3(DT4),times=5L)
Unit: milliseconds
expr min lq median uq max neval
op2 <- option2(DT3) 7738.21094 7810.87777 7838.6067 7969.5543 8407.4069 5
op3 <- option3(DT4) 83.78921 92.65472 320.6273 559.8153 783.0742 5
rowSums
肯定更快。我很高兴与解决方案,除非任何人有更快的东西。
rowSums
is definitely faster. I am happy with the solution unless anyone has something faster.
推荐答案
我的方法如下:
使用 rowSums
以查找要保留的行:
Use rowSums
to find the rows you want to keep:
Dose <- grep("dose", colnames(DT1))
Flag <- rowSums(is.na(DT1[, Dose, with = FALSE])) != length(Dose)
DT1[Flag]
这篇关于应用data.table的行:查找列的子集都为NA的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!