根据正则表达式选择data.table的列 [英] Select columns of data.table based on regex
问题描述
如何基于正则表达式选择data.table的列?
考虑一个简单的示例,如下所示:
How can I select columns of a data.table based on a regex? Consider a simple example as follows:
library(data.table)
mydt <- data.table(foo=c(1,2), bar=c(2,3), baz=c(3,4))
是否可以使用 bar
和<$ c列来自数据表的$ c> baz 基于正则表达式?我知道以下解决方案有效,但是如果表更大,并且我想选择更多变量,则很容易变得麻烦。
Is there a way to use columns of bar
and baz
from the datatable based on a regex? I know that the following solution works but if the table is much bigger and I would like to choose more variables this could easily get cumbersome.
mydt[, .(bar, baz)]
我会希望在 dplyr :: select()
中有类似 matches()
的东西,但仅供参考。
I would like to have something like matches()
in dplyr::select()
but only by reference.
推荐答案
更新:我使用@sindri_baldur的答案更新了比较-使用版本 1.12.6
。根据结果, patterns()
是一个方便的快捷方式,但是如果性能很重要,则应该坚持使用 ..
或 with = FALSE
解决方案(如下所示)。
UPDATE: I updated the comparison with @sindri_baldur's answer - using version 1.12.6
. According to the results, patterns()
is a handy shortcut, but if performance matters, one should stick with the ..
or with = FALSE
solution (see below).
显然,有一个新的方法可以从1.10.2版开始实现。
Apparently, there is a new way of achieving this from version 1.10.2 onwards.
library(data.table)
cols <- grep("bar|baz", names(mydt), value = TRUE)
mydt[, ..cols]
在发布的解决方案中,它似乎工作最快。
It seems to work the fastest out of the posted solutions.
# Creating a large data.table with 100k rows, 32 columns
n <- 100000
foo_cols <- paste0("foo", 1:30)
big_dt <- data.table(bar = rnorm(n), baz = rnorm(n))
big_dt[, (foo_cols) := rnorm(n)]
# Methods
subsetting <- function(dt) {
subset(dt, select = grep("bar|baz", names(dt)))
}
usingSD <- function(dt) {
dt[, .SD, .SDcols = names(dt) %like% "bar|baz"]
}
usingWith <- function(dt) {
cols <- grep("bar|baz", names(dt), value = TRUE)
dt[, cols, with = FALSE]
}
usingDotDot <- function(dt) {
cols <- grep("bar|baz", names(dt), value = TRUE)
dt[, ..cols]
}
usingPatterns <- function(dt) {
dt[, .SD, .SDcols = patterns("bar|baz")]
}
# Benchmark
microbenchmark(
subsetting(big_dt), usingSD(big_dt), usingWith(big_dt), usingDotDot(big_dt),
times = 5000
)
#Unit: microseconds
# expr min lq mean median uq max neval
# subsetting(big_dt) 430 759 1672 1309 1563 82934 5000
# usingSD(big_dt) 547 951 1872 1461 1797 60357 5000
# usingWith(big_dt) 278 496 1331 1112 1304 62656 5000
# usingDotDot(big_dt) 289 483 1392 1117 1344 55878 5000
# usingPatterns(big_dt) 596 1019 1984 1518 1913 120331 5000
这篇关于根据正则表达式选择data.table的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!