R data.table:通过键和组将多列与data.table相交的最快方法是什么 [英] R data.table: what is the fastest way to intersect a data.table by multiple columns by keys and groups

查看:61
本文介绍了R data.table:通过键和组将多列与data.table相交的最快方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

主要编辑以弄清楚答案是否正确

我有一个data.table,其中包含组列(split_by),键列(key_by )和特征ID列(intersect_by)

I have a data.table with group columns (split_by), key columns (key_by) and trait ids columns (intersect_by)

我希望在split_by的每个组中,仅保留特征ID由该组中所有当前键共享的行。

I want in each group of split_by, keep only the rows where the trait ids are shared by all the present keys in the group.

例如:

dt <- data.table(id = 1:6, key1 = 1, key2 = c(1:2, 2), group_id1= 1, group_id2= c(1:2, 2:1, 1:2), trait_id1 = 1, trait_id2 = 2:1)
setkey(dt, group_id1, group_id2, trait_id1, trait_id2)
dt
   id key1 key2 group_id1 group_id2 trait_id1 trait_id2
1:  4    1    1         1         1         1         1
2:  1    1    1         1         1         1         2
3:  5    1    2         1         1         1         2
4:  2    1    2         1         2         1         1
5:  6    1    2         1         2         1         1
6:  3    1    2         1         2         1         2

res <- intersect_this_by(dt,
                         key_by = c("key1"),
                         split_by = c("group_id1", "group_id2"),
                         intersect_by = c("trait_id1", "trait_id2"))

我希望res像

> res[]
   id key1 key2 group_id1 group_id2 trait_id1 trait_id2
1:  1    1    1         1         1         1         2
2:  5    1    2         1         1         1         2
3:  2    1    2         1         2         1         1
4:  6    1    2         1         2         1         1
5:  3    1    2         1         2         1         2

我们看到删除了id 4,因为group_id1 = 1和group_id2 = 1组合组(id 4所属的组)只有一个具有这些特征的键(1,1)组合(1,1),而该组中有两个键组合:(1,1)和(1,2),因此特征(1,1)未被该组中的所有键共享,因此我们从此特征中删除此特征组,因此删除id4。相反,id 1和5具有相同的特征,但键不同,它们代表了该组中的所有键((1,1)和(1,2)),所以id 1和5的特征

We see id 4 has been dropped as in group_id1 = 1 and group_id2 = 1 combination group (the group which id 4 belongs) there is only one combination of keys (1,1) which has these traits (1,1) whereas there are two keys combinations in this group: (1,1) and (1,2) so the traits (1,1) are not shared by all keys in this group so we drop this trait from this group, hence drop id 4. On the contrary, id 1 and 5 have same traits but different keys and they represent all the keys ( (1,1) and (1,2)) in this group so traits of id 1 and 5 are kept.

给出了实现此目的的函数那里:

A function to achieve this is given there:

intersect_this_by2 <- function(dt,
                               key_by = NULL,
                               split_by = NULL,
                               intersect_by = NULL){

    dtc <- as.data.table(dt)       

    # compute number of keys in the group
    dtc[, n_keys := uniqueN(.SD), by = split_by, .SDcols = key_by]
    # compute number of keys represented by each trait in each group 
    # and keep row only if they represent all keys from the group
    dtc[, keep := n_keys == uniqueN(.SD), by = c(intersect_by, split_by), .SDcols = key_by]
    dtc <- dtc[keep == TRUE][, c("n_keys", "keep") := NULL]
    return(dtc)      
}

但是对于大型数据集或复杂的特征/键/组来说,它变得相当慢...
真正的data.table有1000万行,特征有30个级别...
有什么方法可以改善吗?有明显的陷阱吗?
感谢您的帮助

But it gets quite slow for big datasets or complex traits/keys/groups... the real data.table has got 10 millions rows and the traits have 30 levels... Is there any way to improve it? Any obvious pitfalls? Thanks for the help

最终编辑: Uwe提出了一个简洁的解决方案,它比我的初始代码快40%(我
最终函数如下所示:

FINAL Uwe proposed a concise solution which is 40% faster than my initial code (which I deleted here because it was confusing) The final function looks like this:

intersect_this_by_uwe <- function(dt,
                                  key_by = c("key1"),
                                  split_by = c("group_id1", "group_id2"),
                                  intersect_by = c("trait_id1", "trait_id2")){
    dti <- copy(dt)
    dti[, original_order_id__ := 1:.N]
    setkeyv(dti, c(split_by, intersect_by, key_by))
    uni <- unique(dti, by = c(split_by, intersect_by, key_by))
    unique_keys_by_group <-
        unique(uni, by = c(split_by, key_by))[, .N, by = c(split_by)]
    unique_keys_by_group_and_trait <-
        uni[, .N, by = c(split_by, intersect_by)]
    # 1st join to pick group/traits combinations with equal number of unique keys
    selected_groups_and_traits <-
        unique_keys_by_group_and_trait[unique_keys_by_group,
                                       on = c(split_by, "N"), nomatch = 0L]
    # 2nd join to pick records of valid subsets
    dti[selected_groups_and_traits, on = c(split_by, intersect_by)][
        order(original_order_id__), -c("original_order_id__","N")]
}

并记录下1000万行数据集的基准:

And for the records the benchmarks on the 10M rows dataset:

> microbenchmark::microbenchmark(old_way = {res <- intersect_this_by(dt,
+                                                                    key_by = c("key1"),
+                                                                    split_by = c("group_id1", "group_id2"),
+                                                                    intersect_by = c("trait_id1", "trait_id2"))},
+                                new_way = {res <- intersect_this_by2(dt,
+                                                                     key_by = c("key1"),
+                                                                     split_by = c("group_id1", "group_id2"),
+                                                                     intersect_by = c("trait_id1", "trait_id2"))},
+                                new_way_uwe = {res <- intersect_this_by_uwe(dt,
+                                                                            key_by = c("key1"),
+                                                                            split_by = c("group_id1", "group_id2"),
+                                                                            intersect_by = c("trait_id1", "trait_id2"))},
+                                times = 10)
Unit: seconds
        expr       min        lq      mean    median        uq       max neval cld
     old_way  3.145468  3.530898  3.514020  3.544661  3.577814  3.623707    10  b 
     new_way 15.670487 15.792249 15.948385 15.988003 16.097436 16.206044    10   c
 new_way_uwe  1.982503  2.350001  2.320591  2.394206  2.412751  2.436381    10 a  


推荐答案

OP的其他解释,我相信它会取得更好的效果

With the additional explanations by the OP, I believe to have gained a better understanding of the problem.

OP希望从他的数据集中删除不完整的子集。每个 group_id1 group_id2 组均包含一组唯一的 key1 价值观。一个完整的子集至少包含一个 group_id1 group_id2 trait_id1 key1的每个 trait_id2 key1 记录 group_id1 group_id2 组中的值。

The OP wants to remove incomplete subsets from his dataset. Each group_id1, group_id2 group contains a set of unique key1 values. A complete subset contains at least one group_id1, group_id2, trait_id1, trait_id2, key1 record for each of the key1 values in the group_id1, group_id2 group.

比较上的分组时,不必检查 key1 group_id1 group_id2 trait_id1 trait_id2 级别和 group_id1 group_id2 级别。检查不同的 key1 值的数量是否相等就足够了。

It is not necessary to check the key1 values when comparing the grouping on the group_id1, group_id2, trait_id1, trait_id2 level with the group_id1, group_id2 level. It is sufficient to check if the number of distinct key1 values is equal.

因此,下面的解决方案遵循< a href = https://stackoverflow.com/a/50601189/3817004> OP自己的答案,但使用两个联接来实现结果:

So, the solution below follows the general outline of OP's own answer but uses two joins to achieve the result:

setkey(dt, group_id1, group_id2, trait_id1, trait_id2, key1)
uni <- unique(dt, by = c("group_id1", "group_id2", "trait_id1", "trait_id2", "key1"))
unique_keys_by_group <- 
  unique(uni, by = c("group_id1", "group_id2", "key1"))[, .N, by = .(group_id1, group_id2)]
unique_keys_by_group_and_trait <- 
  uni[, .N, by = .(group_id1, group_id2, trait_id1, trait_id2)]
# 1st join to pick group/traits combinations with equal number of unique keys
selected_groups_and_traits <- 
  unique_keys_by_group_and_trait[unique_keys_by_group, 
                                 on = .(group_id1, group_id2, N), nomatch = 0L]
# 2nd join to pick records of valid subsets
res <- dt[selected_groups_and_traits, on = .(group_id1, group_id2, trait_id1, trait_id2)][
  order(id), -"N"]

可以验证结果是否与OP的结果:

It can be verified that the result is identical to OP's result:

identical(
  intersect_this_by(dt,
                    key_by = c("key1"),
                    split_by = c("group_id1", "group_id2"),
                    intersect_by = c("trait_id1", "trait_id2")),
  res)



[1] TRUE


请注意,使用了 uniqueN()函数由于我的第一个(错误的)答案的基准测试中显示的性能问题。

Note that the uniqueN() function is not used due to performance issues as shown in the benchmarks of my first (wrong) answer.

OP的基准数据为已使用(1000万行)。

OP's benchmark data is used (10 M rows).

library(microbenchmark)
mb <- microbenchmark(
  old_way = {
    DT <- copy(dt)
    res <- intersect_this_by(DT,
                             key_by = c("key1"),
                             split_by = c("group_id1", "group_id2"),
                             intersect_by = c("trait_id1", "trait_id2"))
  },
  uwe = {
    DT <- copy(dt)
    setkey(DT, group_id1, group_id2, trait_id1, trait_id2, key1)
    uni <- 
      unique(DT, by = c("group_id1", "group_id2", "trait_id1", "trait_id2", "key1"))
    unique_keys_by_group <- 
      unique(uni, by = c("group_id1", "group_id2", "key1"))[
        , .N, by = .(group_id1, group_id2)]
    unique_keys_by_group_and_trait <- 
      uni[, .N, by = .(group_id1, group_id2, trait_id1, trait_id2)]
    selected_groups_and_traits <- 
      unique_keys_by_group_and_trait[unique_keys_by_group, 
                                     on = .(group_id1, group_id2, N), nomatch = 0L]
    res <- DT[selected_groups_and_traits, 
              on = .(group_id1, group_id2, trait_id1, trait_id2)][
      order(id), -"N"]
  },
  times = 3L)
mb

此处显示的解决方案速度提高了40%:

The solution presented here is 40% faster:


Unit: seconds
    expr      min       lq     mean   median       uq      max neval cld
 old_way 7.251277 7.315796 7.350636 7.380316 7.400315 7.420315     3   b
     uwe 4.379781 4.461368 4.546267 4.542955 4.629510 4.716065     3  a





编辑:进一步的性能改进


Op要求提出进一步提高性能的想法。


Further performance improvements

The Op has asked for ideas to further improve performance.

我已经尝试了多种方法,包括双重嵌套分组(使用慢速 uniqueN()只是为了简化代码显示):

I already have tried different approaches including a double nested grouping (using slow uniqueN() just for simplified display of code):

res <- DT[, {
  nuk_g = uniqueN(key1) 
  .SD[, if(nuk_g == uniqueN(key1)) .SD, by = .(trait_id1, trait_id2)]
}, by = .(group_id1, group_id2)][order(id)]

,但是对于,它们的速度都较慢给定的基准数据

特定方法的性能可能并不完全取决于问题的大小,即问题的数量行,但也包含问题结构上的 ,例如,不同组,处理和键的数量以及数据类型等。

It is likely that perfomance of a particular method does not depend solely on the problem size, ie., the number of rows, but also on the problem structure e.g., the number of different groups, treats, and keys as as well as on data types, etc.

在不了解生产数据的结构和计算流程的上下文的情况下,我认为不值得花费更多时间

So, without knowing the structure of your production data and the context of your computational flow I do not think it is worthwhile to spent more time on benchmarking.

无论如何,有一个建议:确保 setkey()仅被调用一次,因为它相当昂贵(大约2秒钟),但可以加快所有后续操作的速度。 (使用 options(datatable.verbose = TRUE)进行验证)。

Anyway, there is one suggestion: Make sure that setkey() is called only once as it is rather costly (about 2 seconds) but speeds-up all subsequent operations. (Verify with options(datatable.verbose = TRUE)).

这篇关于R data.table:通过键和组将多列与data.table相交的最快方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆