R:从5个元素的组合的数据帧中提取内部较高级别的组合(1、2、3和4个元素的组) [英] R: extract inner higher level combinations (groups of 1, 2, 3, and 4 elements) out of a data frame of combinations of 5 elements

查看:46
本文介绍了R:从5个元素的组合的数据帧中提取内部较高级别的组合(1、2、3和4个元素的组)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

抱歉,我必须在

从图中可以很清楚地看到,@ pauls的答案胜过@ekoam的答案,但是在速度方面,以上两个函数均胜过所提供的答案.但是问题是输入可能具有任意数量的维数,因此还有一个问题,即我们的函数如何根据列数和唯一值的数量进行缩放(这里我们只有"+"和"-",但是如果我们还有更多呢?).为此,我重新运行了 n_columns = 3、4,...,10 n_values = 2、4,... 10 的基准测试.这2个结果通过下面的平滑曲线可视化.
首先,我们将时间可视化为列数的函数.请注意, y 轴为对数刻度(以10为底),以便于比较.

从可视化中可以很明显地看出,随着列数的增加,方法的选择变得非常重要.@ekoam的建议变得很慢,主要是因为它将对 unique 的调用延迟到最后.剩下的3种方法都快得多,而一旦获得8列以上的数据,与其他方法相比, nullgrid.expand.dt 的速度快10倍以上.

接下来让我们看一下与每列中的值数量相比的时间(n列固定为5)

同样,我们看到了类似的图片.除了 nullgrid.expand 的单个异常值(随着唯一值数量的增加,它似乎比保罗回答的速度变慢)之外,我们看到 nullgrid.expand.dt 保持更快,尽管在这里似乎只保存了(exp(4)-exp(3.6))/exp(3.6)〜50%(或两倍),而不是保罗的回答.时间达到10个唯一值.

请注意,我没有足够的RAM来运行基准测试以显示大于或等于显示的唯一值或列的数量.

结论

我们已经看到有很多方法可以找到问题的答案,但是随着列数和唯一值的增加,方法的选择变得越来越重要.通过利用优化的库,我们可以以最小的工作量大大减少获得所有列值的交叉联接所需的时间.通过使用 Rcpp 的扩展工作,我们可能会进一步降低时间复杂度,但这不在我的回答范围内.

基准代码

 #设置:set.seed(1234)图书馆(tidyverse)库(data.table)nullgrid.expand<-函数(df,...)expand.grid(lapply(df [c(NA,seq_len(nrow(df))),],unique),...)nullgrid.expand.dt<-函数(df,...)do.call(CJ,args = c(as.list(df [c(NA,seq_len(nrow(df))),]),排序= FALSE,唯一= TRUE))标记=字母[1:5]plusminus_df<-expand.grid(lapply(seq(markers),function(x)c("+",-")),stringsAsFactors = FALSE)名称(plusminus_df)=字母[1:5]bm<-microbenchmark(nullgrid.expand = nullgrid.expand(plusminus_df),nullgrid.expand.dt = nullgrid.expand.dt(plusminus_df),ekoam = unique(bind_rows(apply(plusminus_df,1公升,函数(r)头(expand.grid(lapply(r,c,NA_character _),stringsAsFactors = FALSE),-1L)))),保罗= {plusminus_df%>%add_row()%>%地图(唯一)%&%;%expand.grid()},控制=列表(热身= 5))库(ggplot2)autoplot(bm)+ ggtitle('交叉连接之间的比较') 

定时功能

  time_function<-function(n = 5,j = 2){idx<-seq_len(n)df<-do.call(CJ,args = c(lapply(idx,function(x)as.character(seq_len(j)))),排序= FALSE,唯一= TRUE))名称(df)<-as.字符(idx)微基准nullgrid.expand = nullgrid.expand(df),nullgrid.expand.dt = nullgrid.expand.dt(df),ekoam = unique(bind_rows(apply(df,1L,函数(r)头(expand.grid(lapply(r,c,NA_character _),stringsAsFactors = FALSE),-1L)))),保罗= {df%>%add_row()%>%地图(唯一)%&%;%expand.grid()},时间= 10,控制=列表(热身= 5))}res<-lapply(seq(3,10),time_function)for(i in seq_along(res)){res [[i]] $ n <-seq(3,10)[i]}ggplot(rbindlist(res),aes(x = n,y = log(time/10 ^ 4,base = 10),col = expr))+geom_smooth(se = FALSE)+ggtitle('时间比较给定的列数')+实验室(y ='log(ms)',x ='n')ggsave('so_2.png')res<-lapply(c(seq(2,10,2)),time_function,n = 5)for(i in seq_along(res)){res [[i]] $ n <-seq(2,10,2)[i]}ggplot(rbindlist(res),aes(x = n,y = log(time/10 ^ 4,base = 10),col = expr))+geom_smooth(se = FALSE)+ggtitle('时间比较给定的唯一值数量')+实验室(y ='log(ms)',x ='每列n个唯一值')ggsave('so_3.png') 

Sorry I have to post another question following up on this one and this other one.

While the answer to the second one addresses the MWE perfectly, in my real world data I need to do things differently, and wondered if someone could help.

So this time around, my starting point is a data frame (named plusminus_df) of combinations of 5 elements (in reality it can be 1 to n), of the following form:

> markers=LETTERS[1:5]
> plusminus_df <- expand.grid(lapply(seq(markers), function(x) c("+","-")), stringsAsFactors=FALSE)
> names(plusminus_df)=LETTERS[1:5]
> head(plusminus_df)
  A B C D E
1 + + + + +
2 - + + + +
3 + - + + +
4 - - + + +
5 + + - + +
6 - + - + +

So it is just a dataframe of combinations of +/- for all the 5 markers (note this is a variable number). What I would need to do at this point, is to extract the inner higher level combinations of 1, 2, 3, and 4 markers (note these are variable numbers), preserving this same dataframe structure (in that sense, I would need to include NAs).

So my expected result would be something like this:

> final_df
      A    B    C    D    E
1     + <NA> <NA> <NA> <NA>
2     - <NA> <NA> <NA> <NA>
3     +    - <NA> <NA> <NA>
4     -    - <NA> <NA> <NA>
5     +    + <NA> <NA> <NA>
6     -    + <NA> <NA> <NA>
7     +    -    - <NA> <NA>
8     -    -    - <NA> <NA>
9     +    +    + <NA> <NA>
10    -    +    + <NA> <NA>
11    +    -    + <NA> <NA>
12    -    -    + <NA> <NA>
13    +    +    - <NA> <NA>
14    -    +    - <NA> <NA>
15    +    -    -    - <NA>
16    -    -    -    - <NA>
17    +    +    +    + <NA>
...
n     +    +    +    +    +
n+1   -    +    +    +    +
n+2   +    -    +    +    +
n+3   -    -    +    +    +
n+4   +    +    -    +    +
n+5   -    +    -    +    +
...

With all the possible combinations of 1 marker (+ and -), 2 markers, 3, 4, and 5 (as in the original), filling in the non-used markers with NA.

So the answer to the second question works well to build this desired final dataframe from scratch, just from the original markers vector. But in my real world case I can actually retrieve a filtered down list of 5 marker combinations in the form of the plusminus_df above... What would be the most straightforward and efficient way to obtain the desired dataframe from this one, without having to deal with messy nested loops?

解决方案

I'm not completely certain I've understood what you're looking for, but from the second question it looks like you are looking for all cross-combinations of columns within a data.frame.

Disclaimer: The two answers already provided are more readable, where I focus on speed.

As you are performing what is often known as a cross-join (or outer-full-join) one aspect that quickly becomes a concern as n increases is efficiency. For efficiency it helps to split the problem into smaller sub-problems, and find a solution for each problem. As we need to find all unique combinations within the set of columns including the null set (value = NA), we can reduce this problem into 2 sub-problems.

  1. Find unique values for each column including the null set
  2. Expand this set to include all combinations of each column.

Using this idea we can quickly concoct a simple solution using expand.grid, unique and lapply. The only tricky part is to include the null set, but we can do this by selecting NA row from the data.frame including all rows.

# Create null-set-included data.frame
nullset_df <- plusminus_df[c(NA, seq_len(nrow(plusminus_df))), ]
# Find all unique elements, including null set
unique_df <- lapply(nullset_df, unique)
# Combine all unique sets
expand.grid(unique_df)

or as a function

nullgrid.expand <- function(df, ...)
  expand.grid(lapply(df[c(NA, seq_len(nrow(df))), ], unique), ...)

This is fairly fast (benchmarks and performance graphs later), but I wanted to go one step further. The data.table package is known for it's high-performance functions, and one of those functions in the CJ function, for performing cross-joins. Below is one implementation using CJ

library(data.table)
nullgrid.expand.dt <- function(df, ...)
  do.call(CJ, args = c(as.list(df[c(NA, seq_len(nrow(df))), ]),
                       sorted = FALSE,
                       unique = TRUE))

The function requires vector input, forcing one to use do.call (or similar) which makes the function slightly less readable. But is there any performance gain? To test this, I ran a microbenchmark on the two functions and the ones provided by the existing answers (code below), the result is visualized in a violin plot below:

From the plot it is quite clear that @pauls answer outperforms @ekoam's answer, but the two functions above both outperform the provided answers in terms of speed. But the question said that the input might have any number of dimension, so there is also the question of how well our function scales with the number of columns and the number of unique values (here we only have "+" and "-" but what if we had more?). For this I reran the benchmark for n_columns = 3, 4, ..., 10 and n_values = 2, 4, ... 10. The 2 results are visualized with smooth curves below.
First we'll visualize the time as a function of number of columns. Note that the y axis is on logarithmic scale (base 10) for easier comparison.

From the visualization it is quite clear that, with increasing number of columns, the choice of method becomes very important. The suggestion by @ekoam becomes very slow, primarily because it delays a call to unique till the very end. The remaining 3 methods are all much faster, while nullgrid.expand.dt becomes more than 10 times faster in comparison to the remaining methods once we get more than 8 columns of data.

Next lets look at the timing compared to the number of values in each column (n-columns fixed at 5)

Again we see a similar picture. Except for a single possible outlier for nullgrid.expand, which seems to become slower than the answer by paul as the number of unique values increase, we see that nullgrid.expand.dt remains faster, although here it seems to only be saving (exp(4) - exp(3.6)) / exp(3.6) ~ 50 % (or twice as fast) compared to paul's answer by the time we reach 10 unique values.

Please note that I did not have enough RAM to run the benchmark for number of unique values or columns greater than the ones shown.

Conclusion

We've seen that there are many ways to reach the answer sought by the question, but as the number of columns and unique values increase the choice of method becomes more and more important. By utilizing optimized libraries, we can drastically reduce the time required to get the cross-join of all column values, with only minimal effort. With extended effort using Rcpp we could likely reduce the time complexity even further, while this is outside the scope of my answer.

Benchmark code

# Setup:
set.seed(1234)
library(tidyverse)
library(data.table)
nullgrid.expand <- function(df, ...)
  expand.grid(lapply(df[c(NA, seq_len(nrow(df))), ], unique), ...)
nullgrid.expand.dt <- function(df, ...)
  do.call(CJ, args = c(as.list(df[c(NA, seq_len(nrow(df))), ]),
                       sorted = FALSE,
                       unique = TRUE))
markers=LETTERS[1:5]
plusminus_df <- expand.grid(lapply(seq(markers), function(x) c("+","-")), stringsAsFactors=FALSE)
names(plusminus_df)=LETTERS[1:5]

bm <- microbenchmark(
  nullgrid.expand = nullgrid.expand(plusminus_df),
  nullgrid.expand.dt = nullgrid.expand.dt(plusminus_df),
  ekoam = unique(bind_rows(apply(
    plusminus_df, 1L, 
    function(r) head(expand.grid(lapply(r, c, NA_character_), stringsAsFactors = FALSE), -1L)
  ))),
  paul = {
    plusminus_df %>%
      add_row() %>%
      map(unique) %>%
      expand.grid()
  }, 
  control = list(warmup = 5)
)
library(ggplot2)
autoplot(bm) + ggtitle('comparison between cross-join')

Timing function

time_function <- function(n = 5, j = 2){
  idx <- seq_len(n)
  df <- do.call(CJ, args = c(lapply(idx, function(x) as.character(seq_len(j))),
                             sorted = FALSE,
                             unique = TRUE))
  names(df) <- as.character(idx)
  microbenchmark(
    nullgrid.expand = nullgrid.expand(df),
    nullgrid.expand.dt = nullgrid.expand.dt(df),
    ekoam = unique(bind_rows(apply(
      df, 1L, 
      function(r) head(expand.grid(lapply(r, c, NA_character_), stringsAsFactors = FALSE), -1L)
    ))),
    paul = {
      df %>%
        add_row() %>%
        map(unique) %>%
        expand.grid()
    }, 
    times = 10,
    control = list(warmup = 5)
  )
}
res <- lapply(seq(3, 10), time_function)
for(i in seq_along(res)){
  res[[i]]$n <- seq(3, 10)[i]
}
ggplot(rbindlist(res), aes(x = n, y = log(time / 10^4, base = 10), col = expr)) + 
  geom_smooth(se = FALSE) + 
  ggtitle('time-comparison given number of columns') + 
  labs(y = 'log(ms)', x = 'n')
ggsave('so_2.png')

res <- lapply(c(seq(2, 10, 2)), time_function, n = 5)
for(i in seq_along(res)){
  res[[i]]$n <- seq(2, 10, 2)[i]
}
ggplot(rbindlist(res), aes(x = n, y = log(time / 10^4, base = 10), col = expr)) + 
  geom_smooth(se = FALSE) + 
  ggtitle('time-comparison given number of unique values') + 
  labs(y = 'log(ms)', x = 'n unique values per column')
ggsave('so_3.png')

这篇关于R:从5个元素的组合的数据帧中提取内部较高级别的组合(1、2、3和4个元素的组)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆