将数据表分成大致相等的部分 [英] Split data.table into roughly equal parts
问题描述
为了并行化一个任务,我需要将一个大数据表分成大致相等的部分,
保持一个由 id
列定义的组。假设:
N
是数据的长度
k 是
id
< <>
M
是所需部件的数量 想法是M < k<< N,所以拆分 id
是不好的。
library(data.table)
库(dplyr)
set.seed 1)
在应用中N <16#非常大
k < - 6#在应用k < N
dt< - data.table(id = sample(letters [1:k],N,replace = T),value = runif(N))%>%
arrange
t(dt $ id)
#[,1] [,2] [,3] [,4] [,5]
#[1,]abb bbcccdddeefff
M = 3
的所需拆分为{<$ c $ p>
{a,b},{c,d},{e,f}}
和M = 4
c $ c> {{a,b},{c},{d,e},{f}}
如果id是数字,截断点应该是
quantile(id,probs = seq(0,1,length.out = M + 1),type = 1)
有效的方法是什么?
解决方案初步注释
我建议阅读 data.table的主要作者不得不说关于它的并行化。
我不知道你对data.table很熟悉,但你可能忽略了它的
by
参数...?从下面引用@ eddi的注释...
而不是逐字地拆分数据 - 创建一个新的parallel.id然后调用
dt [,parallel_operation(.SD),by = parallel.id]
c
按大小排序ID:
ids < - names(sort(table(dt $ id)))
n< - length(ids)
重新排列,以便在大和小ID之间交替,按照Arun的交错技巧:
alt_ids< - c(ids,rev (c(1:n,1:n))] [1:n]
(如 zero323的回答):
$ b $每个组中的ID大致相同b
gs < - split(alt_ids,ceiling(seq(n)/(n / M)))
res& - vector(list,M)
setkey(dt,id)
for(m in 1:M)res [[m]] < - dt [J(gs [[m] )]
#如果使用data.frame,将最后两行替换为
#for(m in 1:M)res [[m]] < - dt [id%in%gs [ [m]],]
检查尺寸是否太差:
#使用OP的示例数据...
sapply(res,nrow)
#[1] 7 9 for M = 2
#[1] 5 5 6 for M = 3
#[1] 1 6 3 6 for M = 4
#[1] 1 4 2 3 6对于M = 5
虽然我强调了
data.table
在顶部,这应该工作正常与data.frame
,To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column,
id
. Suppose:
N
is the length of the data
k
is the number of distinct values ofid
M
is the number of desired partsThe idea is that M << k << N, so splitting by
id
is no good.library(data.table) library(dplyr) set.seed(1) N <- 16 # in application N is very large k <- 6 # in application k << N dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>% arrange(id) t(dt$id) # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] # [1,] "a" "b" "b" "b" "b" "c" "c" "c" "d" "d" "d" "e" "e" "f" "f" "f"
in this example, the desired split for
M=3
is{{a,b}, {c,d}, {e,f}}
and forM=4
is{{a,b}, {c}, {d,e}, {f}}
More generally, if id were numeric, the cutoff points should be
quantile(id, probs=seq(0, 1, length.out = M+1), type=1)
or some similar split to roughly-equal parts.What is an efficient way to do this?
解决方案Preliminary comment
I recommend reading what the main author of data.table has to say about parallelization with it.
I don't know how familiar you are with data.table, but you may have overlooked its
by
argument...? Quoting @eddi's comment from below...Instead of literally splitting up the data - create a new "parallel.id" column, and then call
dt[, parallel_operation(.SD), by = parallel.id]
Answer, assuming you don't want to use
by
Sort the IDs by size:
ids <- names(sort(table(dt$id))) n <- length(ids)
Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:
alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]
Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):
gs <- split(alt_ids, ceiling(seq(n) / (n/M))) res <- vector("list", M) setkey(dt, id) for (m in 1:M) res[[m]] <- dt[J(gs[[m]])] # if using a data.frame, replace the last two lines with # for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],]
Check that the sizes aren't too bad:
# using the OP's example data... sapply(res, nrow) # [1] 7 9 for M = 2 # [1] 5 5 6 for M = 3 # [1] 1 6 3 6 for M = 4 # [1] 1 4 2 3 6 for M = 5
Although I emphasized
data.table
at the top, this should work fine with adata.frame
, too.这篇关于将数据表分成大致相等的部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!