在 R/Rcpp 中过滤 data.frame 列表列内容的最快方法 [英] Fastest way to filter a data.frame list column contents in R / Rcpp

查看：16 发布时间：2021/12/23 12:51:25 r performance data.table dplyr rcpp

本文介绍了在 R/Rcpp 中过滤 data.frame 列表列内容的最快方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 data.frame:

I have a data.frame:

df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b", 
"c"))), .Names = c("id", "vars"), row.names = c(NA, -3L), class = "data.frame")

带有一个列表列(每个列都有一个字符向量):

with a list column (each with a character vector):

> str(df)
'data.frame':   3 obs. of  2 variables:
     $ id  : int  1 2 3
     $ vars:List of 3
      ..$ : chr "a"
      ..$ : chr  "a" "b" "c"
      ..$ : chr  "b" "c"

我想根据setdiff(vars,remove_this)

library(dplyr)
library(tidyr)
res <- df %>% mutate(vars = lapply(df$vars, setdiff, "a"))

这让我明白:

   > res
      id vars
    1  1     
    2  2 b, c
    3  3 b, c

但是要删除 character(0) 变量，我必须执行以下操作:

But to get drop the character(0) vars I have to do something like:

res %>% unnest(vars) # and then do the equivalent of nest(vars) again after...

实际数据集:

560K 行和 3800K 行，还有 10 多列(用于携带).

(这很慢，这会导致问题......)

(this is quite slow, which leads to question...)

是否有 dplyr/data.table/其他更快的方法?
如何使用 Rcpp 做到这一点?

Is there a dplyr/ data.table/ other faster method?
How to do this with Rcpp?

可以就地修改列而不是通过复制 lapply(vars,setdiff(... 结果?

如果必须是一个单独的步骤，那么过滤掉vars == character(0)的最有效方法是什么.

what's the most efficient way to filter out for vars == character(0) if it must be a seperate step.

推荐答案

抛开任何算法改进不谈，类似的 data.table 解决方案会自动变得更快，因为您不必复制整个事情只是添加一列:

Setting aside any algorithmic improvements, the analogous data.table solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:

library(data.table)
dt = as.data.table(df)  # or use setDT to convert in place

dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
#   id  vars newcol
#1:  2 a,b,c    b,c
#2:  3   b,c    b,c

您也可以删除原始列(成本基本上为 0)，方法是在末尾添加 [, vars := NULL]).或者，如果您不需要该信息，您可以简单地覆盖初始列，即 dt[, vars := lapply(vars, setdiff, 'a')].

You can also delete the original column (with basically 0 cost), by adding [, vars := NULL] at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].

现在就算法改进而言，假设您的 id 值对于每个 vars 都是唯一的(如果不是，则添加一个新的唯一标识符)，我认为这是更快并自动处理过滤:

Now as far as algorithmic improvements go, assuming your id values are unique for each vars (and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:

dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
#   id vars
#1:  2  b,c
#2:  3  b,c

为了继续其他列，我认为简单地合并回来最简单:

To carry along the other columns, I think it's easiest to simply merge back:

dt[, othercol := 5:7]

# notice the keyby
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
#   id vars i.vars othercol
#1:  2  b,c  a,b,c        6
#2:  3  b,c    b,c        7

这篇关于在 R/Rcpp 中过滤 data.frame 列表列内容的最快方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 R/Rcpp 中过滤 data.frame 列表列内容的最快方法 [英] Fastest way to filter a data.frame list column contents in R / Rcpp

问题描述

实际数据集:

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 R/Rcpp 中过滤 data.frame 列表列内容的最快方法 [英] Fastest way to filter a data.frame list column contents in R / Rcpp

问题描述

实际数据集:

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭