在R / Rcpp中过滤data.frame列表列内容的最快方法 [英] Fastest way to filter a data.frame list column contents in R / Rcpp

查看:236
本文介绍了在R / Rcpp中过滤data.frame列表列内容的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.frame:

  df<  -  structure(list(id = 1:3,vars = (a,c(a,b,c),c(b,
c))),.Names = c(id,vars ),row.names = c(NA,-3L),class =data.frame)

与列表列(每个都有一个字符向量):

 > str(df)
'data.frame':3 obs。的2个变量:
$ id:int 1 2 3
$ vars:3的列表
.. $:chra
.. $:chra bc
.. $:chrbc



要根据 setdiff(vars,remove_this)

过滤数据框架。

  library(dplyr)
library(tidyr)
res< - df%>%mutate(vars = lapply(df $ vars,setdiff,a))

$>

 c>> res 
id vars
1 1
2 2 b,c
3 3 b,c

但是要减少字符(0) vars我需要做一些事情:

  res%>%unnest(vars)#然后在...之后再次执行nest(vars)的等效操作



实际数据集:




  • 560K行和3800K行还有10个栏(可以继续)。



(这很慢, >

R

中最快的方式是什么?


  • 是否有 dplyr / data.table /其他更快的方法? >
  • 如何使用 Rcpp



UPDATE / EXTENSION:




  • 可以进行列修改,而不是复制 lapply (vars,setdiff(... result?


  • 什么是过滤出



除了任何算法改进,类似的 data.table 解决方案自动变得更快,因为你赢了不必复制整个事情只是为了添加一个列:

  library(data.table)
dt = as.data.table(df)#或使用setDT来转换

dt [,newcol:= lapply(vars,setdiff,'a')] [sapply(newcol,length)!= 0]
#id vars newcol
#1:2 a,b,cb,c
#2:3 b,cb,c
/ pre>

您也可以通过添加 [,vars:= NULL] 在结尾)。或者你可以简单地覆盖初始列,如果你不需要这个信息,即 dt [,vars:= lapply(vars,setdiff,'a')] 。 / p>




现在只要算法的改进,假设你的 id 对于每个 vars 是唯一的(如果没有,添加一个新的唯一标识符),我认为这是更快,自动照顾过滤:

  dt [,unlist(vars),by = id] [!V1%in%'a',。(vars = list(V1) by = id] 
#id vars
#1:2 b,c
#2:3 b,c

为了继续其他列,我认为最简单的合并回来:

  dt [,othercol:= 5:7] 

注意keyby
dt [,unlist(vars),by = id] 。(vars = list(V1)),keyby = id] [dt,nomatch = 0]
#id vars i.vars othercol
#1:2 b,ca,b,c 6
#2:3 b,cb,c 7


I have a data.frame:

df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b", 
"c"))), .Names = c("id", "vars"), row.names = c(NA, -3L), class = "data.frame")

with a list column (each with a character vector):

> str(df)
'data.frame':   3 obs. of  2 variables:
     $ id  : int  1 2 3
     $ vars:List of 3
      ..$ : chr "a"
      ..$ : chr  "a" "b" "c"
      ..$ : chr  "b" "c"

I want to filter the data.frame according to setdiff(vars,remove_this)

library(dplyr)
library(tidyr)
res <- df %>% mutate(vars = lapply(df$vars, setdiff, "a"))

which gets me this:

   > res
      id vars
    1  1     
    2  2 b, c
    3  3 b, c

But to get drop the character(0) vars I have to do something like:

res %>% unnest(vars) # and then do the equivalent of nest(vars) again after...

Actual datasets:

  • 560K rows and 3800K rows that also have 10 more columns (to carry along).

(this is quite slow, which leads to question...)

What is the Fastest way to do this in R?

  • Is there a dplyr/ data.table/ other faster method?
  • How to do this with Rcpp?

UPDATE/EXTENSION:

  • can the column modification be done in place rather then by copying the lapply(vars,setdiff(... result?

  • what's the most efficient way to filter out for vars == character(0) if it must be a seperate step.

解决方案

Setting aside any algorithmic improvements, the analogous data.table solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:

library(data.table)
dt = as.data.table(df)  # or use setDT to convert in place

dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
#   id  vars newcol
#1:  2 a,b,c    b,c
#2:  3   b,c    b,c

You can also delete the original column (with basically 0 cost), by adding [, vars := NULL] at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].


Now as far as algorithmic improvements go, assuming your id values are unique for each vars (and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:

dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
#   id vars
#1:  2  b,c
#2:  3  b,c

To carry along the other columns, I think it's easiest to simply merge back:

dt[, othercol := 5:7]

# notice the keyby
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
#   id vars i.vars othercol
#1:  2  b,c  a,b,c        6
#2:  3  b,c    b,c        7

这篇关于在R / Rcpp中过滤data.frame列表列内容的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆