使用data.table R选择行或列? [英] Selecting rows or columns with data.table R?
问题描述
想象我有一个data.table,例如:
)
RRR< -data.table(1:15,runif(15),rgeom(15,0.5),rbinom(15,2,0.5))
V1 V2 V3 V4
1:1 0.33577273 0 0
2:2 0.66739739 2 1
3:3 0.07501655 0 0
4:4 0.43195663 2 1
5:5 0.39525841 3 2
6:6 0.15189738 1 1
7:7 0.02637279 0 1
8:8 0.44165623 0 1
9:9 0.98710570 2 0
10:10 0.62402805 1 0
11:11 0.84829465 3 2
12:12 0.02170976 0 1
13:13 0.74608925 0 2
14:14 0.29102296 2 0
15:15 0.83820646 1 1
如何从中获取data.table,所有ROWS包含0在任何列? (或某个值)
如果我必须使用单个列,我可以使用:
RRR [V4 == 0,]
V1 V2 V3 V4
1:1 0.33577273 0 0
2:3 0.07501655 0 0
3: 9 0.98710570 2 0
4:10 0.62402805 1 0
5:14 0.29102296 2 0
但是,如果我想同时使用所有列,因为我有很多列?
这不会做我需要的。
RRR [,sapply(RRR,function(xx)(xx == 0) with = TRUE]
V1 V2 V3 V4
[1,] FALSE FALSE TRUE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
[7,] FALSE FALSE TRUE FALSE
[8,] FALSE FALSE TRUE FALSE
[9,] FALSE FALSE FALSE TRUE
[10,] FALSE FALSE TRUE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE TRUE FALSE
[13,] FALSE FALSE TRUE FALSE
[14,] FALSE FALSE FALSE TRUE
[15,] FALSE FALSE FALSE
也许使用for循环和一些复杂的粘贴。
虽然,我更喜欢使用简单的data.table语法。
同样,你将如何得到一个data.table所有COLUMNS包含一个' 0在任何行?
我知道如何获得满足条件的列(作为一个整体),例如数字,
RRR [,sapply(RRR,function(xx)is.numeric(xx)),with = FALSE]
但是如果我想要以元素方式测试条件,这个方法不起作用。
如果有人感兴趣,这是system.time()一个更大的随机data.table与您提供的不同的解决方案,到目前为止,稍作修改。set.seed(1)
n < - 1000000
RRR < - data.table(matrix(rgeom(100 * n,0.5 ),ncol = 100))
获取ROWS
> RRR [RRR [,rowSums(RRR == 0)> 0]]
用户系统已经过
2.72 0.55 3.27
> RRR [rowSums(RRR == 0)> 0]
用户系统已过
2.58 0.70 3.28
> RRR [apply(RRR,MAR = 1,function(xx)any(xx == 0))]
用户系统已过
10.81 0.19 11.00
> RRR [apply(RRR [,paste0('V',1:ncol(RRR)),with = FALSE],function(xx)any(xx == 0),MAR = 1)]
用户系统已过
10.49 0.30 10.83
获取COLUMNS
> RRR [,sapply(RRR,function(xx)any(xx == 0)),with = FALSE]
用户系统已过
0.81 0.31 1.12
> `[.listof`(RRR,colSums(RRR == 0)> 0)
用户系统已过
2.14 0.27 2.41
> RRR [,colSums(RRR == 0)> 0,with = FALSE]
用户系统已经过
2.26 0.48 2.75
> RRR [,.SD,.SDcols = sapply(RRR,function(x)any(x == 0))] #only版本1.9.5,似乎与第一个解决方案相同。
用户系统已过
0.78 0.36 1.14
> RRR [,.SD,.SDcols = sapply(RRR,function(x)any(!as.logical(x)))]
用户系统已经过
0.41 0.25 0.66
& RRR [Reduce('|',lapply(RRR,function(xx)(xx == 0))]]
用户系统已经过
3.11 0.33 3.44
> RRR [,apply(RRR [,paste0('V',1:ncol(RRR)),with = FALSE],function(xx)any(xx == 0),MAR = 2),with = FALSE]
用户系统已过
3.48 0.80 4.28
p>
RRR [,i:= any(unlist(lapply(.SD,function(x)x == 0))),seq_len (nrow(RRR))] [i == TRUE] [,i:= NULL]
花了几分钟,我停下来,它标记行,而不是提取它们,这是最复杂的解决方案。
我会等待更快或更简单的解决方案并听到您的意见和喜好。
sapply本应该是更慢,但不是。
如果data.table包含其他类型的数据,结果可能会改变。
如果我们可以停止一旦第一次发生在每行或每列中,测试(== 0)。但我想我们不能做没有循环或一些低级访问或按位操作。
我想到了一个新的方法。
a)sapply(RRR,function(xx)which(xx == 0))
b)我需要将a)的结果与列表,但我不知道如何为任何数量的列。
c)然后得到行RRR [a)]
我想它会慢得多RRR [unique] (unlist(sapply(RRR,function(xx)which(xx == 0)))]
相反的选项是RRR [(RRR == 0)] < - NA; na.omit(RRR)
解决方案
rowSums
函数可以在这里使用:RRR [rowSums(!RRR)> 0]
它的工作原理: c $!$ RRR 是一个
TRUE
为零。在一般情况下,您可以用您想要检查的任何逻辑条件替换
!RRR
。例如,要查看是否有任何元素等于3
,您可以使用的
。rowSums
> RRR == 3
我认为
rowSums(test(x))> 0
基本上与apply相同(RRR,1,function(x)any(!test(x)))
都将对象强制转换为矩阵。我发现rowSums
版本更容易阅读,并认为我听说人们赞美其效率。
对于列,类似地:
RRR [,colSums(!RRR) = FALSE]
Imagine I have a data.table, for example:library(data.table) RRR <-data.table(1:15,runif(15),rgeom(15,0.5),rbinom(15,2,0.5)) V1 V2 V3 V4 1: 1 0.33577273 0 0 2: 2 0.66739739 2 1 3: 3 0.07501655 0 0 4: 4 0.43195663 2 1 5: 5 0.39525841 3 2 6: 6 0.15189738 1 1 7: 7 0.02637279 0 1 8: 8 0.44165623 0 1 9: 9 0.98710570 2 0 10: 10 0.62402805 1 0 11: 11 0.84829465 3 2 12: 12 0.02170976 0 1 13: 13 0.74608925 0 2 14: 14 0.29102296 2 0 15: 15 0.83820646 1 1
How can I get a data.table from it, with all the ROWS that contain a "0" at any column? (or some value)
If I had to do it with a single column I could use:RRR[V4==0,] V1 V2 V3 V4 1: 1 0.33577273 0 0 2: 3 0.07501655 0 0 3: 9 0.98710570 2 0 4: 10 0.62402805 1 0 5: 14 0.29102296 2 0
But what if I want to do it with all the columns at once because I have many?
This doesn't do what I need.
RRR[,sapply(RRR,function(xx)(xx==0)), with=TRUE] V1 V2 V3 V4 [1,] FALSE FALSE TRUE TRUE [2,] FALSE FALSE FALSE FALSE [3,] FALSE FALSE TRUE TRUE [4,] FALSE FALSE FALSE FALSE [5,] FALSE FALSE FALSE FALSE [6,] FALSE FALSE FALSE FALSE [7,] FALSE FALSE TRUE FALSE [8,] FALSE FALSE TRUE FALSE [9,] FALSE FALSE FALSE TRUE [10,] FALSE FALSE FALSE TRUE [11,] FALSE FALSE FALSE FALSE [12,] FALSE FALSE TRUE FALSE [13,] FALSE FALSE TRUE FALSE [14,] FALSE FALSE FALSE TRUE [15,] FALSE FALSE FALSE FALSE
Maybe with a for loop and some complicated paste?. Though, I would prefer to use simple data.table syntax.
Similarly, how would you get a data.table with all the COLUMNS that contain a '0' at any row?
I know how to get the columns (as a whole) that fulfills a condition, such as being numeric,
RRR[,sapply(RRR,function(xx)is.numeric(xx)),with=FALSE]
but this method doesn't work if I want to test the condition elementwise.
In case anybody is interested, this is the system.time() for a bigger random data.table with the different solutions you provided so far, with slight modifications.set.seed(1) n <- 1000000 RRR <- data.table(matrix(rgeom(100*n,0.5), ncol=100)) Getting ROWS > RRR[RRR[,rowSums(RRR==0)>0]] user system elapsed 2.72 0.55 3.27 > RRR[rowSums(RRR==0)>0] user system elapsed 2.58 0.70 3.28 > RRR[apply(RRR,MAR=1,function(xx)any(xx==0))] user system elapsed 10.81 0.19 11.00 > RRR[apply(RRR[,paste0('V',1:ncol(RRR)),with=FALSE],function(xx)any(xx==0),MAR=1)] user system elapsed 10.49 0.30 10.83 Getting COLUMNS > RRR[,sapply(RRR,function(xx)any(xx==0)), with=FALSE] user system elapsed 0.81 0.31 1.12 > `[.listof`(RRR,colSums(RRR==0)>0) user system elapsed 2.14 0.27 2.41 > RRR[,colSums(RRR==0)>0, with=FALSE] user system elapsed 2.26 0.48 2.75 > RRR[, .SD, .SDcols=sapply(RRR, function(x) any(x==0))] #only version 1.9.5, seems the same solution than the first one. user system elapsed 0.78 0.36 1.14 > RRR[, .SD, .SDcols=sapply(RRR, function(x) any(!as.logical(x)))] user system elapsed 0.41 0.25 0.66 > RRR[Reduce('|',lapply(RRR,function(xx)(xx==0)))] user system elapsed 3.11 0.33 3.44 > RRR[,apply(RRR[,paste0('V',1:ncol(RRR)),with=FALSE],function(xx)any(xx==0),MAR=2),with=FALSE] user system elapsed 3.48 0.80 4.28
I haven't included yet:
RRR[, i := any(unlist(lapply(.SD, function(x) x==0))), seq_len(nrow(RRR))][i==TRUE][,i:=NULL]
It took several minutes and I stopped it, and it "tags" the rows instead of extracting them and it's the most complex solution.
I'll wait for faster or simpler solutions and hear your comments and likings.
sapply was supposed to be slower but it isn't. The results could change if the data.table contains other kind of data.
We could speed it up if we can stop the test (==0) as soon as the first occurrence happens within every row or column. But I guess we can't do it without loops or some low level access or bitwise operation.
I've thought of a new method.
a) sapply(RRR,function(xx)which(xx==0))
b) I need to combine the results of a) with a union of the lists, but I don't know how to do it for any number of columns.
c) And then get that rows RRR["a)"] I guess it's gonna be much slower if the number of zeroes is big.maybe
RRR[unique(unlist(sapply(RRR,function(xx)which(xx==0))))]
but it's too slow.
An option to get the opposite would be "RRR[(RRR==0)] <- NA; na.omit(RRR)"
解决方案The
rowSums
function can be used here:RRR[rowSums(!RRR)>0]
How it works:
!RRR
is a matrix withTRUE
at any zero. In the general case, you can replace!RRR
with whatever logical condition you want to check. For example, to see if any element is equal to3
, you could take therowSums
ofRRR==3
.I think
rowSums(test(x))>0
is essentially the same asapply(RRR,1,function(x)any(!test(x)))
; both coerce the object to a matrix. I find therowSums
version easier to read and think I've heard people praise its efficiency.
For columns, similarly:
RRR[, colSums(!RRR)>0, with=FALSE]
这篇关于使用data.table R选择行或列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!