R-将大型数据帧并行拆分为列表 [英] R - split large dataframe into list in parallel
问题描述
我有一个大的交易数据集(大约500万行),我需要按ID(大约100万个唯一ID)划分所有交易.预期结果将是列表中有项目的唯一ID.
I have a large transaction dataset (around 5 million rows), i need to split all transactions by ID (around 1 million unique ID). The expected results would be unique ID with item in lists.
I did try the most simple and direct way to split the transaction dataset (by referring to Why is split inefficient on large data frames with many groups? ), i know that convert dataframe into datatable might be more efficient.
样本源df
set.seed(123)
n = 500000 #number of sample data (500k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""),
Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE)
)
将字符转换为因数
x$ID <- as.character(x$ID)
x$Item <- as.factor(x$Item)
将df转换为dt,然后将dt拆分为列表
library(data.table)
x <- as.data.table(x)
system.time(
xx <- split(x$Item, x$ID)
)
列表中的预期结果
head(xx, 2)
#$A100
#[1] tea orange
#Levels: apple lemon orange rice tea
#$A101
#[1] rice
#Levels: apple lemon orange rice tea
问题:运行2个小时后,在我的机器(4核,16Gb RAM,Win10,R 3.4.3)上,它仍然运行且从未完成.我确实在运行时检查了CPU使用率,它仅消耗了35-40%的CPU使用率.
Problem: After running for 2 hours, on my machine (4 cores, 16Gb RAM, Win10, R 3.4.3) it still running and never completes. I did check my CPU usage when it's running, it only consumed 35-40% of the CPU usage.
我的想法:
My idea:
我在想,有什么方法可以仅使用detectCores()-1 = 3个内核来充分利用机器的计算能力(并行运行"split").
I'm thinking is there any way to fully utilized the computational power of my machine (run the "split" in parallel), using only detectCores() - 1 = 3 cores.
第一个::按ID将大型交易数据集分成3个较小的分区(较小的数据集)
1st: Split the large transaction dataset by IDs into 3 smaller partitions (smaller dataset)
第二个:使用foreach循环将拆分的3个分区(较小的数据集)并行运行到列表中,然后为每个迭代附加(行绑定)每个列表,直到结束.
2nd: Using foreach loop to run split 3 partitions (smaller dataset) into list in parallel, then append(row bind) each list for every iteration until the end.
问题 :我的想法可行吗?我确实读过关于mclapply
的内容,它是mc.cores
,但似乎mc.cores = 1
是Windows的唯一选择,因此对我的情况没有帮助.有没有更好,更有效的方法来对大型数据集进行拆分?欢迎发表任何评论,谢谢!
Question: Is my idea practical? i did read about mclapply
and it's mc.cores
, but seems like mc.cores = 1
is the only option for windows, so it won't help for my case. Is there any better and more efficient way to do the split for large dataset? Any comment is welcome, Thanks!
推荐答案
令人惊讶和有趣的是,考虑by
(tapply
的面向对象的包装器),它在具有附加功能的数据帧上的作用类似于split
运行拆分成一个函数调用.与split
等效的是返回参数或调用identity
.
Surprisingly and interestingly, consider by
(the object-oriented wrapper to tapply
) which operates similarly as split
on data frames with an added feature to run splits into a function call. The equivalent to split
would be to return the argument or call identity
.
by(x$Item, x$ID, function(x) x)
by(x$Item, x$ID, identity)
请注意,by
的返回是一个by
类对象,该对象实际上是具有其他属性的列表.
Do note, the return of by
is a by
class object which essentially is a list with additional attributes.
使用您的随机数据帧示例,base::split
在1小时后没有完成,但是base::by
在装有64 GB RAM的计算机上的5分钟以下表现良好!通常,我认为by
会给应聘家庭带来更多负担,但我的看法可能会很快改变.
Using your random data frame example, base::split
did not finish after 1 hour, but base::by
did well below 5 mins on my machine with a 64 GB RAM! Usually, I assumed by
would have more overhead being a sibling to the apply family but my opinion may soon change.
5万行示例
set.seed(123)
n = 50000 #number of sample data (50k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""),
Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE)
)
system.time( xx <- split(x$Item, x$ID) )
# user system elapsed
# 20.09 0.00 20.09
system.time( xx2 <- by(x$Item, x$ID, identity) )
# user system elapsed
# 1.55 0.00 1.55
all.equal(unlist(xx), unlist(xx2))
# [1] TRUE
identical(unlist(xx), unlist(xx2))
# [1] TRUE
50万行示例
set.seed(123)
n = 500000 #number of sample data (500k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""),
Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE)
)
system.time( xx <- split(x$Item, x$ID) )
# DID NOT FINISH AFTER 1 HOUR
system.time( xx2 <- by(x$Item, x$ID, identity) )
# user system elapsed
# 23.00 0.06 23.09
源代码显示split.default
可能在R级别(不同于C或Fortran)运行更多进程,并且跨因子levels
进行了for
循环:
Source code reveals split.default
might run more processes at the R (unlike C or Fortran) level with a for
loop across factor levels
:
getAnywhere(split.data.frame)
function (x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...)
{
if (!missing(...))
.NotYetUsed(deparse(...), error = FALSE)
if (is.list(f))
f <- interaction(f, drop = drop, sep = sep, lex.order = lex.order)
else if (!is.factor(f))
f <- as.factor(f)
else if (drop)
f <- factor(f)
storage.mode(f) <- "integer"
if (is.null(attr(x, "class")))
return(.Internal(split(x, f)))
lf <- levels(f)
y <- vector("list", length(lf))
names(y) <- lf
ind <- .Internal(split(seq_along(x), f))
for (k in lf) y[[k]] <- x[ind[[k]]]
y
}
相反,by.data.frame
的源代码揭示了对tapply
的调用,该调用本身是lapply
的包装:
Conversely, source code for by.data.frame
reveals a call to tapply
which itself is a wrapper to lapply
:
getAnywhere(by.data.frame)
function (data, INDICES, FUN, ..., simplify = TRUE)
{
if (!is.list(INDICES)) {
IND <- vector("list", 1L)
IND[[1L]] <- INDICES
names(IND) <- deparse(substitute(INDICES))[1L]
}
else IND <- INDICES
FUNx <- function(x) FUN(data[x, , drop = FALSE], ...)
nd <- nrow(data)
structure(eval(substitute(tapply(seq_len(nd), IND, FUNx,
simplify = simplify)), data), call = match.call(), class = "by")
}
这篇关于R-将大型数据帧并行拆分为列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!