在 R/Rcpp 中过滤 data.frame 列表列内容的最快方法 [英] Fastest way to filter a data.frame list column contents in R / Rcpp
问题描述
我有一个 data.frame:
I have a data.frame:
df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b",
"c"))), .Names = c("id", "vars"), row.names = c(NA, -3L), class = "data.frame")
带有一个列表列(每个列都有一个字符向量):
with a list column (each with a character vector):
> str(df)
'data.frame': 3 obs. of 2 variables:
$ id : int 1 2 3
$ vars:List of 3
..$ : chr "a"
..$ : chr "a" "b" "c"
..$ : chr "b" "c"
我想根据setdiff(vars,remove_this)
library(dplyr)
library(tidyr)
res <- df %>% mutate(vars = lapply(df$vars, setdiff, "a"))
这让我明白:
> res
id vars
1 1
2 2 b, c
3 3 b, c
但是要删除 character(0)
变量,我必须执行以下操作:
But to get drop the character(0)
vars I have to do something like:
res %>% unnest(vars) # and then do the equivalent of nest(vars) again after...
实际数据集:
- 560K 行和 3800K 行,还有 10 多列(用于携带).
- 是否有
dplyr
/data.table
/其他更快的方法? - 如何使用
Rcpp
做到这一点? - Is there a
dplyr
/data.table
/ other faster method? - How to do this with
Rcpp
? 可以就地修改列而不是通过复制
lapply(vars,setdiff(...
结果?
(这很慢,这会导致问题......)
(this is quite slow, which leads to question...)
如果必须是一个单独的步骤,那么过滤掉vars == character(0)
的最有效方法是什么.
what's the most efficient way to filter out for vars == character(0)
if it must be a seperate step.
推荐答案
抛开任何算法改进不谈,类似的 data.table
解决方案会自动变得更快,因为您不必复制整个事情只是添加一列:
Setting aside any algorithmic improvements, the analogous data.table
solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:
library(data.table)
dt = as.data.table(df) # or use setDT to convert in place
dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
# id vars newcol
#1: 2 a,b,c b,c
#2: 3 b,c b,c
您也可以删除原始列(成本基本上为 0),方法是在末尾添加 [, vars := NULL]
).或者,如果您不需要该信息,您可以简单地覆盖初始列,即 dt[, vars := lapply(vars, setdiff, 'a')]
.
You can also delete the original column (with basically 0 cost), by adding [, vars := NULL]
at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')]
.
现在就算法改进而言,假设您的 id
值对于每个 vars
都是唯一的(如果不是,则添加一个新的唯一标识符),我认为这是更快并自动处理过滤:
Now as far as algorithmic improvements go, assuming your id
values are unique for each vars
(and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
# id vars
#1: 2 b,c
#2: 3 b,c
为了继续其他列,我认为简单地合并回来最简单:
To carry along the other columns, I think it's easiest to simply merge back:
dt[, othercol := 5:7]
# notice the keyby
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
# id vars i.vars othercol
#1: 2 b,c a,b,c 6
#2: 3 b,c b,c 7
这篇关于在 R/Rcpp 中过滤 data.frame 列表列内容的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!