将数据表分成大致相等的部分 [英] Split data.table into roughly equal parts

查看：177 发布时间：2017/3/12 12:46:09 r parallel-processing data.table

本文介绍了将数据表分成大致相等的部分的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

为了并行化一个任务，我需要将一个大数据表分成大致相等的部分，
保持一个由 id 列定义的组。假设：

N 是数据的长度

k 是 id

的不同值的数量
 
 < <>  M 是所需部件的数量
 
 
 想法是M < k<< N，所以拆分 id 是不好的。

  library（data.table）
库（dplyr）
 
 set.seed 1）
在应用中N <16＃非常大
k < -  6＃在应用k < N 
 dt<  -  data.table（id = sample（letters [1：k]，N，replace = T），value = runif（N））％>％
 arrange 
t（dt $ id）
 
＃[，1] [，2] [，3] [，4] [，5] 
＃[1，]abb bbcccdddeefff
   M = 3 的所需拆分为 {<$ c $ p> 
 
  {a，b}，{c，d}，{e，f}}  
和 M = 4  c $ c> {{a，b}，{c}，{d，e}，{f}}  
 
 
 如果id是数字，截断点应该是
 
  quantile（id，probs = seq（0,1，length.out = M + 1），type = 1） 
 
 
 有效的方法是什么？
解决方案
 初步注释 
 
 
 我建议阅读 data.table的主要作者不得不说关于它的并行化。
 
 
 我不知道你对data.table很熟悉，但你可能忽略了它的 by 参数...？从下面引用@ eddi的注释... 
 
 而不是逐字地拆分数据 - 创建一个新的parallel.id然后调用
  dt [，parallel_operation（.SD），by = parallel.id] 
  
 
 
 
 
 
 
 
    c  
 
 
 按大小排序ID：
  ids < -  names（sort（table（dt $ id）））
n<  -  length（ids）
  
重新排列，以便在大和小ID之间交替，按照Arun的交错技巧：
  alt_ids<  -  c（ids，rev （c（1：n，1：n））] [1：n] 
  
 （如 zero323的回答）：
 $ b $每个组中的ID大致相同b 
 
  gs < -  split（alt_ids，ceiling（seq（n）/（n / M）））
 
 res& -  vector（list，M）
 setkey（dt，id）
 for（m in 1：M）res [[m]] < -  dt [J（gs [[m] ）] 
＃如果使用data.frame，将最后两行替换为
＃for（m in 1：M）res [[m]] < -  dt [id％in％gs [ [m]]，] 
  
检查尺寸是否太差：
 ＃使用OP的示例数据... 
 
 sapply（res，nrow）
＃[1] 7 9 for M = 2 
＃[1] 5 5 6 for M = 3 
＃[1] 1 6 3 6 for M = 4 
＃[1] 1 4 2 3 6对于M = 5 
  
虽然我强调了 data.table 在顶部，这应该工作正常与 data.frame ，
 
To parallelize a task, I need to split a big data.table to roughly equal parts, 
keeping together groups deinfed by a column, id. Suppose:

N is the length of the data

k is the number of distinct values of id

M is the number of desired parts

The idea is that M << k << N, so splitting by id is no good. 
library(data.table)
library(dplyr)

set.seed(1)
N <- 16 # in application N is very large
k <- 6  # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
      arrange(id)
t(dt$id)

#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# [1,] "a"  "b"  "b"  "b"  "b"  "c"  "c"  "c"  "d"  "d"   "d"   "e"   "e"   "f"   "f"   "f"  
in this example, the desired split for M=3 is {{a,b}, {c,d}, {e,f}}
and for M=4 is {{a,b}, {c}, {d,e}, {f}}

More generally, if id were numeric, the cutoff points should be

quantile(id, probs=seq(0, 1, length.out = M+1), type=1) or some similar split to roughly-equal parts.

What is an efficient way to do this?
 解决方案 
Preliminary comment

I recommend reading what the main author of data.table has to say about parallelization with it.

I don't know how familiar you are with data.table, but you may have overlooked its by argument...? Quoting @eddi's comment from below...

  Instead of literally splitting up the data - create a new "parallel.id" column, and then call 
dt[, parallel_operation(.SD), by = parallel.id] 





Answer, assuming you don't want to use by

Sort the IDs by size:
ids   <- names(sort(table(dt$id)))
n     <- length(ids)
Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:
alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]
Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):
gs  <- split(alt_ids, ceiling(seq(n) / (n/M)))

res <- vector("list", M)
setkey(dt, id)
for (m in 1:M) res[[m]] <- dt[J(gs[[m]])] 
# if using a data.frame, replace the last two lines with
# for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],] 
Check that the sizes aren't too bad:
# using the OP's example data...

sapply(res, nrow)
# [1] 7 9              for M = 2
# [1] 5 5 6            for M = 3
# [1] 1 6 3 6          for M = 4
# [1] 1 4 2 3 6        for M = 5
Although I emphasized data.table at the top, this should work fine with a data.frame, too.

                        这篇关于将数据表分成大致相等的部分的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

将数据表分成大致相等的部分 [英] Split data.table into roughly equal parts

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将数据表分成大致相等的部分 [英] Split data.table into roughly equal parts

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭