使用data.table R选择行或列? [英] Selecting rows or columns with data.table R?

查看:105
本文介绍了使用data.table R选择行或列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



想象我有一个data.table,例如:

 
RRR< -data.table(1:15,runif(15),rgeom(15,0.5),rbinom(15,2,0.5))

V1 V2 V3 V4
1:1 0.33577273 0 0
2:2 0.66739739 2 1
3:3 0.07501655 0 0
4:4 0.43195663 2 1
5:5 0.39525841 3 2
6:6 0.15189738 1 1
7:7 0.02637279 0 1
8:8 0.44165623 0 1
9:9 0.98710570 2 0
10:10 0.62402805 1 0
11:11 0.84829465 3 2
12:12 0.02170976 0 1
13:13 0.74608925 0 2
14:14 0.29102296 2 0
15:15 0.83820646 1 1

如何从中获取data.table,所有ROWS包含0在任何列? (或某个值)


如果我必须使用单个列,我可以使用:

  RRR [V4 == 0,] 

V1 V2 V3 V4
1:1 0.33577273 0 0
2:3 0.07501655 0 0
3: 9 0.98710570 2 0
4:10 0.62402805 1 0
5:14 0.29102296 2 0

但是,如果我想同时使用所有列,因为我有很多列?



这不会做我需要的。

  RRR [,sapply(RRR,function(xx)(xx == 0) with = TRUE] 

V1 V2 V3 V4
[1,] FALSE FALSE TRUE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
[7,] FALSE FALSE TRUE FALSE
[8,] FALSE FALSE TRUE FALSE
[9,] FALSE FALSE FALSE TRUE
[10,] FALSE FALSE TRUE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE TRUE FALSE
[13,] FALSE FALSE TRUE FALSE
[14,] FALSE FALSE FALSE TRUE
[15,] FALSE FALSE FALSE

也许使用for循环和一些复杂的粘贴。
虽然,我更喜欢使用简单的data.table语法。



同样,你将如何得到一个data.table所有COLUMNS包含一个' 0在任何行?



我知道如何获得满足条件的列(作为一个整体),例如数字,

  RRR [,sapply(RRR,function(xx)is.numeric(xx)),with = FALSE] 



但是如果我想要以元素方式测试条件,这个方法不起作用。





如果有人感兴趣,这是system.time()一个更大的随机data.table与您提供的不同的解决方案,到目前为止,稍作修改。

  set.seed(1)
n < - 1000000
RRR < - data.table(matrix(rgeom(100 * n,0.5 ),ncol = 100))

获取ROWS
> RRR [RRR [,rowSums(RRR == 0)> 0]]
用户系统已经过
2.72 0.55 3.27
> RRR [rowSums(RRR == 0)> 0]
用户系统已过
2.58 0.70 3.28
> RRR [apply(RRR,MAR = 1,function(xx)any(xx == 0))]
用户系统已过
10.81 0.19 11.00
> RRR [apply(RRR [,paste0('V',1:ncol(RRR)),with = FALSE],function(xx)any(xx == 0),MAR = 1)]
用户系统已过
10.49 0.30 10.83

获取COLUMNS
> RRR [,sapply(RRR,function(xx)any(xx == 0)),with = FALSE]
用户系统已过
0.81 0.31 1.12
> `[.listof`(RRR,colSums(RRR == 0)> 0)
用户系统已过
2.14 0.27 2.41
> RRR [,colSums(RRR == 0)> 0,with = FALSE]
用户系统已经过
2.26 0.48 2.75
> RRR [,.SD,.SDcols = sapply(RRR,function(x)any(x == 0))] #only版本1.9.5,似乎与第一个解决方案相同。
用户系统已过
0.78 0.36 1.14
> RRR [,.SD,.SDcols = sapply(RRR,function(x)any(!as.logical(x)))]
用户系统已经过
0.41 0.25 0.66
& RRR [Reduce('|',lapply(RRR,function(xx)(xx == 0))]]
用户系统已经过
3.11 0.33 3.44
> RRR [,apply(RRR [,paste0('V',1:ncol(RRR)),with = FALSE],function(xx)any(xx == 0),MAR = 2),with = FALSE]
用户系统已过
3.48 0.80 4.28

p>

  RRR [,i:= any(unlist(lapply(.SD,function(x)x == 0))),seq_len (nrow(RRR))] [i == TRUE] [,i:= NULL] 

花了几分钟,我停下来,它标记行,而不是提取它们,这是最复杂的解决方案。



我会等待更快或更简单的解决方案并听到您的意见和喜好。



sapply本应该是更慢,但不是。
如果data.table包含其他类型的数据,结果可能会改变。





如果我们可以停止一旦第一次发生在每行或每列中,测试(== 0)。但我想我们不能做没有循环或一些低级访问或按位操作。





我想到了一个新的方法。


a)sapply(RRR,function(xx)which(xx == 0))

b)我需要将a)的结果与列表,但我不知道如何为任何数量的列。

c)然后得到行RRR [a)]
我想它会慢得多

  RRR [unique] (unlist(sapply(RRR,function(xx)which(xx == 0)))] 





相反的选项是RRR [(RRR == 0)] < - NA; na.omit(RRR)

解决方案

rowSums 函数可以在这里使用:

  RRR [rowSums(!RRR)> 0] 

它的工作原理: c $!$ RRR 是一个 TRUE 为零。在一般情况下,您可以用您想要检查的任何逻辑条件替换!RRR 。例如,要查看是否有任何元素等于 3 ,您可以使用 rowSums > RRR == 3



我认为 rowSums(test(x))> 0 基本上与 apply相同(RRR,1,function(x)any(!test(x)))都将对象强制转换为矩阵。我发现 rowSums 版本更容易阅读,并认为我听说人们赞美其效率。






对于列,类似地:

  RRR [,colSums(!RRR) = FALSE] 



Imagine I have a data.table, for example:

library(data.table) 
RRR <-data.table(1:15,runif(15),rgeom(15,0.5),rbinom(15,2,0.5))

    V1      V2    V3  V4
 1:  1 0.33577273  0  0
 2:  2 0.66739739  2  1
 3:  3 0.07501655  0  0
 4:  4 0.43195663  2  1
 5:  5 0.39525841  3  2
 6:  6 0.15189738  1  1
 7:  7 0.02637279  0  1
 8:  8 0.44165623  0  1
 9:  9 0.98710570  2  0
10: 10 0.62402805  1  0
11: 11 0.84829465  3  2
12: 12 0.02170976  0  1
13: 13 0.74608925  0  2
14: 14 0.29102296  2  0
15: 15 0.83820646  1  1

How can I get a data.table from it, with all the ROWS that contain a "0" at any column? (or some value)
If I had to do it with a single column I could use:

RRR[V4==0,]

   V1    V2      V3  V4
1:  1 0.33577273  0  0
2:  3 0.07501655  0  0
3:  9 0.98710570  2  0
4: 10 0.62402805  1  0
5: 14 0.29102296  2  0

But what if I want to do it with all the columns at once because I have many?

This doesn't do what I need.

RRR[,sapply(RRR,function(xx)(xx==0)), with=TRUE]   

     V1      V2     V3    V4
[1,]  FALSE FALSE  TRUE  TRUE
[2,]  FALSE FALSE FALSE FALSE
[3,]  FALSE FALSE  TRUE  TRUE
[4,]  FALSE FALSE FALSE FALSE
[5,]  FALSE FALSE FALSE FALSE
[6,]  FALSE FALSE FALSE FALSE
[7,]  FALSE FALSE  TRUE FALSE
[8,]  FALSE FALSE  TRUE FALSE
[9,]  FALSE FALSE FALSE  TRUE
[10,] FALSE FALSE FALSE  TRUE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE  TRUE FALSE
[13,] FALSE FALSE  TRUE FALSE
[14,] FALSE FALSE FALSE  TRUE
[15,] FALSE FALSE FALSE FALSE

Maybe with a for loop and some complicated paste?. Though, I would prefer to use simple data.table syntax.

Similarly, how would you get a data.table with all the COLUMNS that contain a '0' at any row?

I know how to get the columns (as a whole) that fulfills a condition, such as being numeric,

RRR[,sapply(RRR,function(xx)is.numeric(xx)),with=FALSE]

but this method doesn't work if I want to test the condition elementwise.


In case anybody is interested, this is the system.time() for a bigger random data.table with the different solutions you provided so far, with slight modifications.

set.seed(1)
n <- 1000000
RRR <- data.table(matrix(rgeom(100*n,0.5), ncol=100))

Getting ROWS   
> RRR[RRR[,rowSums(RRR==0)>0]] 
   user  system elapsed 
   2.72    0.55    3.27 
> RRR[rowSums(RRR==0)>0] 
   user  system elapsed 
   2.58    0.70    3.28 
> RRR[apply(RRR,MAR=1,function(xx)any(xx==0))]
   user  system elapsed 
   10.81    0.19   11.00       
> RRR[apply(RRR[,paste0('V',1:ncol(RRR)),with=FALSE],function(xx)any(xx==0),MAR=1)]
  user  system elapsed 
  10.49    0.30   10.83 

Getting COLUMNS
> RRR[,sapply(RRR,function(xx)any(xx==0)), with=FALSE] 
   user  system elapsed 
   0.81    0.31    1.12 
> `[.listof`(RRR,colSums(RRR==0)>0) 
   user  system elapsed 
   2.14    0.27    2.41 
> RRR[,colSums(RRR==0)>0, with=FALSE] 
   user  system elapsed 
   2.26    0.48    2.75 
> RRR[, .SD, .SDcols=sapply(RRR, function(x) any(x==0))]      #only version 1.9.5, seems the same solution than the first one.
   user  system elapsed 
   0.78    0.36    1.14 
> RRR[, .SD, .SDcols=sapply(RRR, function(x) any(!as.logical(x)))]
   user  system elapsed 
   0.41    0.25    0.66 
> RRR[Reduce('|',lapply(RRR,function(xx)(xx==0)))]
   user  system elapsed 
   3.11    0.33    3.44 
> RRR[,apply(RRR[,paste0('V',1:ncol(RRR)),with=FALSE],function(xx)any(xx==0),MAR=2),with=FALSE]
   user  system elapsed 
   3.48    0.80    4.28  

I haven't included yet:

RRR[, i := any(unlist(lapply(.SD, function(x) x==0))), seq_len(nrow(RRR))][i==TRUE][,i:=NULL]   

It took several minutes and I stopped it, and it "tags" the rows instead of extracting them and it's the most complex solution.

I'll wait for faster or simpler solutions and hear your comments and likings.

sapply was supposed to be slower but it isn't. The results could change if the data.table contains other kind of data.


We could speed it up if we can stop the test (==0) as soon as the first occurrence happens within every row or column. But I guess we can't do it without loops or some low level access or bitwise operation.


I've thought of a new method.
a) sapply(RRR,function(xx)which(xx==0))
b) I need to combine the results of a) with a union of the lists, but I don't know how to do it for any number of columns.
c) And then get that rows RRR["a)"] I guess it's gonna be much slower if the number of zeroes is big.

maybe

RRR[unique(unlist(sapply(RRR,function(xx)which(xx==0))))]

but it's too slow.

An option to get the opposite would be "RRR[(RRR==0)] <- NA; na.omit(RRR)"

解决方案

The rowSums function can be used here:

RRR[rowSums(!RRR)>0]

How it works: !RRR is a matrix with TRUE at any zero. In the general case, you can replace !RRR with whatever logical condition you want to check. For example, to see if any element is equal to 3, you could take the rowSums of RRR==3.

I think rowSums(test(x))>0 is essentially the same as apply(RRR,1,function(x)any(!test(x))); both coerce the object to a matrix. I find the rowSums version easier to read and think I've heard people praise its efficiency.


For columns, similarly:

RRR[, colSums(!RRR)>0, with=FALSE]

这篇关于使用data.table R选择行或列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆