根据正则表达式选择data.table的列 [英] Select columns of data.table based on regex

查看:100
本文介绍了根据正则表达式选择data.table的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何基于正则表达式选择data.table的列?
考虑一个简单的示例,如下所示:

How can I select columns of a data.table based on a regex? Consider a simple example as follows:

library(data.table)
mydt <- data.table(foo=c(1,2), bar=c(2,3), baz=c(3,4))

是否可以使用 bar 和<$ c列来自数据表的$ c> baz 基于正则表达式?我知道以下解决方案有效,但是如果表更大,并且我想选择更多变量,则很容易变得麻烦。

Is there a way to use columns of bar and baz from the datatable based on a regex? I know that the following solution works but if the table is much bigger and I would like to choose more variables this could easily get cumbersome.

mydt[, .(bar, baz)]

我会希望在 dplyr :: select()中有类似 matches()的东西,但仅供参考。

I would like to have something like matches() in dplyr::select() but only by reference.

推荐答案

更新:我使用@sindri_baldur的答案更新了比较-使用版本 1.12.6 。根据结果​​, patterns()是一个方便的快捷方式,但是如果性能很重要,则应该坚持使用 .. with = FALSE 解决方案(如下所示)。

UPDATE: I updated the comparison with @sindri_baldur's answer - using version 1.12.6. According to the results, patterns() is a handy shortcut, but if performance matters, one should stick with the .. or with = FALSE solution (see below).

显然,有一个新的方法可以从1.10.2版开始实现。

Apparently, there is a new way of achieving this from version 1.10.2 onwards.

library(data.table)
cols <- grep("bar|baz", names(mydt), value = TRUE)
mydt[, ..cols]

在发布的解决方案中,它似乎工作最快。

It seems to work the fastest out of the posted solutions.

# Creating a large data.table with 100k rows, 32 columns
n <- 100000
foo_cols <- paste0("foo", 1:30)
big_dt <- data.table(bar = rnorm(n), baz = rnorm(n))
big_dt[, (foo_cols) := rnorm(n)]

# Methods
subsetting <- function(dt) {
    subset(dt, select = grep("bar|baz", names(dt)))
}

usingSD <- function(dt) {
    dt[, .SD, .SDcols = names(dt) %like% "bar|baz"]
}

usingWith <- function(dt) {
    cols <- grep("bar|baz", names(dt), value = TRUE)
    dt[, cols, with = FALSE]
}

usingDotDot <- function(dt) {
    cols <- grep("bar|baz", names(dt), value = TRUE)
    dt[, ..cols]
}

usingPatterns <- function(dt) {
  dt[, .SD, .SDcols = patterns("bar|baz")]
}

# Benchmark
microbenchmark(
    subsetting(big_dt), usingSD(big_dt), usingWith(big_dt), usingDotDot(big_dt),
    times = 5000
)

#Unit: microseconds
#                  expr  min   lq  mean median    uq    max neval
#    subsetting(big_dt)  430  759  1672   1309  1563  82934  5000
#       usingSD(big_dt)  547  951  1872   1461  1797  60357  5000
#     usingWith(big_dt)  278  496  1331   1112  1304  62656  5000
#   usingDotDot(big_dt)  289  483  1392   1117  1344  55878  5000
# usingPatterns(big_dt)  596 1019  1984   1518  1913 120331  5000

这篇关于根据正则表达式选择data.table的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆