在大数据集上按组拆分和扩展.grid [英] split and expand.grid by group on large data set

查看:65
本文介绍了在大数据集上按组拆分和扩展.grid的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下格式的df,并尝试获取每个组具有所有成对组合的数据框

I have a df in the following format and try to get a dataframe with all the pairwise combinations per group

df<-structure(list(id = c(209044052, 209044061, 209044061, 209044061,209044062, 209044062, 209044062, 209044182, 209044183, 209044295), group = c(2365686, 387969, 388978, 2365686, 387969, 388978, 2365686, 2278460, 2278460, 654238)), .Names = c("id", "group"), row.names = c(NA, -10L), class = "data.frame")

do.call(rbind, lapply(split(df, df$group), function(i) expand.grid(i$id, i$id)))适用于小型数据帧时,我在处理大型数据(约1200万个观察点和约150万个组)时遇到了时间问题.

While do.call(rbind, lapply(split(df, df$group), function(i) expand.grid(i$id, i$id))) works for a small data frame I run into time problems on my large data (~12 million obs. and ~1.5 million groups).

经过一些测试,我意识到split命令似乎是瓶颈,而expand.grid可能也不是最快的解决方案.

After some testing I recognized that the split command seems to be the bottleneck and expand.grid might also not be the fastest solution.

发现了expand.grid的一些改进使用外部而不是expand.grid 以及此处一些更快的拆分替代方法提高R中split()函数的性能?,但很难将所有内容与分组放在一起.

Found some improvements for expand.grid Use outer instead of expand.grid and some faster split alternatives here Improving performance of split() function in R? but struggle to put it all together with grouping.

输出应该类似于

  Var1      Var2
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044295 209044295
209044182 209044182
209044183 209044182
....

另外,我想排除同一对重复,自我参照(例如,在209044061 209044061以上)的重复,并且如果它们的顺序不同(例如,在209044061 209044062209044062 209044061以上),则仅保留一个组合. (无重复的组合).在library(gtools)中使用'combinations()`进行了尝试,但无法弄清楚这是否会使计算速度进一步降低.

As an extra I would like to exclude repetitions of the same pair, self-reference (e.g. above 209044061 209044061) and only keep one combination, if they are in different orders (e.g. above 209044061 209044062and 209044062 209044061) (Combinations without repetitions). Tried library(gtools) with 'combinations()` but could not figure out if this slows down the calculation even more.

推荐答案

一种可能的解决方案是使用data.tablecombinat包来避免相同对和不同顺序的重复:

One possible solution which avoids repetitions of the same pair as well as different orders is using the data.table and combinat packages:

library(data.table)
setDT(df)[order(id), data.table(combinat::combn2(unique(id))), by = group]

     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4:  387969 209044061 209044062
5:  388978 209044061 209044062
6: 2278460 209044182 209044183

在这里使用

order(id)只是为了方便更好地检查结果,但是可以在生产代码中跳过.

order(id) is used here just for convenience to better check the results but can be skipped in production code.

还有另一种方法,将对combn2()的调用替换为非等号联接:

There is another approach where the call to combn2() is replaced by a non-equi join:

mdf <- setDT(df)[order(id), unique(id), by = group]
mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
    allow.cartesian = TRUE]

     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4:  387969 209044061 209044062
5:  388978 209044061 209044062
6: 2278460 209044182 209044183

请注意,非等额联接要求对数据进行排序.

Note that the non-equi join requires the data to be ordered.

第二种方法似乎要快得多

The second method seems to be much faster

# create benchmark data
nr <- 1.2e5L # number of rows
rg <- 8L # number of ids within each group
ng <- nr / rg # number of groups
set.seed(1L)
df2 <- data.table(
  id = sample.int(rg, nr, TRUE),
  group = sample.int(ng, nr, TRUE)
)

#benchmark code
microbenchmark::microbenchmark(
  combn2 = df2[order(group, id), data.table((combinat::combn2(unique(id)))), by = group],
  nej = {
    mdf <- df2[order(group, id), unique(id), by = group]
    mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
        allow.cartesian = TRUE]},
  times = 1L)

对于120000行和14994组,计时是:

For 120000 rows and 14994 groups the timings are:

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
 combn2 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115     1
    nej   137.3228   137.3228   137.3228   137.3228   137.3228   137.3228     1

注意事项

OP 每个groupid的数量对于内存消耗和速度而言至关重要.组合数为 O(n 2 ),正好是 n * (n-1)/ 2 choose(n, 2L),如果 n 是ID的数量.

Caveat

As pointed out by the OP the number of id per group is crucial in terms of memory consumption and speed. The number of combinations is of O(n2), exactly n * (n-1) / 2 or choose(n, 2L) if n is the number of ids.

最大组的大小可以找到

df2[, uniqueN(id), by = group][, max(V1)]

最终结果的总行数可以通过

预先计算,

The total number of rows in the final result can be computed in advance by

df2[, uniqueN(id), by = group][, sum(choose(V1, 2L))]

这篇关于在大数据集上按组拆分和扩展.grid的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆