基于选定列中任何值的高效子集data.table的方法 [英] Efficient way to subset data.table based on value in any of selected columns

查看：57 发布时间：2020/10/15 20:00:14 r data.table

本文介绍了基于选定列中任何值的高效子集data.table的方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个包含6列的data.table

Let's say I've got a data.table with 6 columns

library(data.table)
set.seed(123)
dt <- data.table( id = 1:100,
                  p1 = sample(1:10, 100, replace = TRUE ),
                  p2 = sample(1:10, 100, replace = TRUE ),
                  p3 = sample(1:10, 100, replace = TRUE ),
                  p4 = sample(1:10, 100, replace = TRUE ),
                  p5 = sample(1:10, 100, replace = TRUE ) )

现在，我想对该数据进行子集化。，位于p1-pn列（此处为p1-p5）上。我想保留所有在p列的任何包含 10 值的行。

Now, I want to subset this data.table, on the p1 - pn columns (here: p1-p5). I want to keep all rows where any of the p-columns contains the value of 10.

对于这个小的样本数据表，可以通过以下方式手动完成：

For this small sample data.table, this can be done manually with

test1 <- dt[ p1 == 10 | p2 == 10 | p3 == 10 | p4 == 10 | p5 == 10, ]

但是我的生产数据包含数十个p列，因此手动将它们全部键入痛苦...

But my production data contains dozens of p-columns, so typing them all out manually would be a pain...

我当前的解决方案是首先创建一个我需要的列名的向量：

My current solution is to first create a vector with the column-names i need:

cols <- grep( "^p", names( dt ), value = TRUE )

。 ..然后使用 apply 进行子设置：

...and then do the subsetting using apply:

test2 <- dt[ apply( dt[, ..cols ], 1, function(r) any( r == 10 ) ), ]

检查：

identical(test1, test2)
# TRUE

我的实际问题

上述解决方案（使用 apply ）对我来说足够快。。但是我不确定这是否是最佳解决方案。我是data.table的新手（与SO上的其他人相比），这（可能是？）不是实现我想要的子集的最有效/最有效/最优雅的方法。

The above solution (using apply) is fast enough for me.. But I'm not sure it is the optimal solution. I'm pretty new to data.table (compared to some others here on SO), and this is (probably?) not the most efficient/effective/elegant way to achieve the subset I want.

我在这里学习，所以有人对我提出的问题有更优雅/更好/更快的方法吗？

I'm here to learn, so has anyone got a more elegant/better/faster approach to my subsetting question?

该问题被标记为重复...但是我仍将答案发布在这里：

The question had been marked duplicate... But I'll still post my answers here:

我发现@Marcus的答案是最好的（=可读）代码，并且答案

I found the answer from @Marcus to be the best (=readable) code, and the answer from @akrun to be the fastest.

基准化

data.table，其中包含1,000,000行和50列感兴趣的数据（即p列）

data.table with 1,000,000 rows and 50 columns of interest (i.e. p-columns)

#create sample data
set.seed( 123 )
n   <- 1000000
k   <- 100
dat <- sample( 1:100, n * k, replace = TRUE )
DT  <- as.data.table( matrix( data = dat, nrow = n, ncol = k ) )
setnames( DT, names( DT ), c( paste0( "p", 1:50 ), paste( "r", 1:50 ) ) )

#vector with columns starting with "p"
cols <- grep( "^p", names( DT ), value = TRUE )

apply_method   <- DT[ apply( DT[, ..cols ], 1, function(x) any( x == 10 ) ), ]
reduce_method  <- DT[ DT[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]]
rowsums_method <- DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1 ]

identical(  apply_method, rowsums_method )

microbenchmark::microbenchmark(
  apply   = DT[ apply( DT[ , ..cols ], 1, function(x) any( x == 10 ) ), ],
  reduce  = DT[ DT[, Reduce( `|`, lapply( .SD, `==`, 10 ) ), .SDcols = cols ] ],
  rowSums = DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1, ],
  times = 10
)

#    expr       min        lq      mean    median        uq       max neval
#   apply 3352.0640 3441.7760 3665.5004 3662.7666 3760.7553 4325.9125    10
#  reduce  408.6349  437.6806  552.8850  572.2012  657.6072  710.7699    10
# rowSums  619.2594  663.7325  784.2389  850.0963  868.2096  892.7469    10

基于选定列中任何值的高效子集data.table的方法 [英] Efficient way to subset data.table based on value in any of selected columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

基于选定列中任何值的高效子集data.table的方法 [英] Efficient way to subset data.table based on value in any of selected columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭