R-将大型数据帧并行拆分为列表 [英] R - split large dataframe into list in parallel

查看:97
本文介绍了R-将大型数据帧并行拆分为列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的交易数据集(大约500万行),我需要按ID(大约100万个唯一ID)划分所有交易.预期结果将是列表中有项目的唯一ID.

I have a large transaction dataset (around 5 million rows), i need to split all transactions by ID (around 1 million unique ID). The expected results would be unique ID with item in lists.

我确实尝试了最简单直接的方法来分割交易数据集(通过引用

I did try the most simple and direct way to split the transaction dataset (by referring to Why is split inefficient on large data frames with many groups? ), i know that convert dataframe into datatable might be more efficient.

样本源df

set.seed(123)
n = 500000 #number of sample data (500k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""), 
                Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE) 
                )

将字符转换为因数

x$ID <- as.character(x$ID)
x$Item <- as.factor(x$Item)

将df转换为dt,然后将dt拆分为列表

library(data.table)
x <- as.data.table(x)
system.time(
  xx <- split(x$Item, x$ID)
)

列表中的预期结果

head(xx, 2)
#$A100
#[1] tea    orange
#Levels: apple lemon orange rice tea

#$A101
#[1] rice
#Levels: apple lemon orange rice tea

问题:运行2个小时后,在我的机器(4核,16Gb RAM,Win10,R 3.4.3)上,它仍然运行且从未完成.我确实在运行时检查了CPU使用率,它仅消耗了35-40%的CPU使用率.

Problem: After running for 2 hours, on my machine (4 cores, 16Gb RAM, Win10, R 3.4.3) it still running and never completes. I did check my CPU usage when it's running, it only consumed 35-40% of the CPU usage.

我的想法:

My idea:

我在想,有什么方法可以仅使用detectCores()-1 = 3个内核来充分利用机器的计算能力(并行运行"split").

I'm thinking is there any way to fully utilized the computational power of my machine (run the "split" in parallel), using only detectCores() - 1 = 3 cores.

第一个::按ID将大型交易数据集分成3个较小的分区(较小的数据集)

1st: Split the large transaction dataset by IDs into 3 smaller partitions (smaller dataset)

第二个:使用foreach循环将拆分的3个分区(较小的数据集)并行运行到列表中,然后为每个迭代附加(行绑定)每个列表,直到结束.

2nd: Using foreach loop to run split 3 partitions (smaller dataset) into list in parallel, then append(row bind) each list for every iteration until the end.

问题 :我的想法可行吗?我确实读过关于mclapply的内容,它是mc.cores,但似乎mc.cores = 1是Windows的唯一选择,因此对我的情况没有帮助.有没有更好,更有效的方法来对大型数据集进行拆分?欢迎发表任何评论,谢谢!

Question: Is my idea practical? i did read about mclapply and it's mc.cores, but seems like mc.cores = 1 is the only option for windows, so it won't help for my case. Is there any better and more efficient way to do the split for large dataset? Any comment is welcome, Thanks!

推荐答案

令人惊讶和有趣的是,考虑by(tapply的面向对象的包装器),它在具有附加功能的数据帧上的作用类似于split运行拆分成一个函数调用.与split等效的是返回参数或调用identity.

Surprisingly and interestingly, consider by (the object-oriented wrapper to tapply) which operates similarly as split on data frames with an added feature to run splits into a function call. The equivalent to split would be to return the argument or call identity.

by(x$Item, x$ID, function(x) x)

by(x$Item, x$ID, identity)

请注意,by的返回是一个by类对象,该对象实际上是具有其他属性的列表.

Do note, the return of by is a by class object which essentially is a list with additional attributes.

使用您的随机数据帧示例,base::split在1小时后没有完成,但是base::by在装有64 GB RAM的计算机上的5分钟以下表现良好!通常,我认为by会给应聘家庭带来更多负担,但我的看法可能会很快改变.

Using your random data frame example, base::split did not finish after 1 hour, but base::by did well below 5 mins on my machine with a 64 GB RAM! Usually, I assumed by would have more overhead being a sibling to the apply family but my opinion may soon change.

5万行示例

set.seed(123)
n = 50000 #number of sample data (50k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""), 
                Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE) 
)

system.time( xx <- split(x$Item, x$ID) )
#   user  system elapsed 
#  20.09    0.00   20.09 

system.time( xx2 <- by(x$Item, x$ID, identity) )
#   user  system elapsed 
#   1.55    0.00    1.55 

all.equal(unlist(xx), unlist(xx2))
# [1] TRUE

identical(unlist(xx), unlist(xx2))
# [1] TRUE

50万行示例

set.seed(123)
n = 500000 #number of sample data (500k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""), 
                Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE) 
)

system.time( xx <- split(x$Item, x$ID) )
# DID NOT FINISH AFTER 1 HOUR

system.time( xx2 <- by(x$Item, x$ID, identity) )
#   user  system elapsed 
#  23.00    0.06   23.09 


源代码显示split.default可能在R级别(不同于C或Fortran)运行更多进程,并且跨因子levels进行了for循环:


Source code reveals split.default might run more processes at the R (unlike C or Fortran) level with a for loop across factor levels:

getAnywhere(split.data.frame)

function (x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...) 
{
    if (!missing(...)) 
        .NotYetUsed(deparse(...), error = FALSE)
    if (is.list(f)) 
        f <- interaction(f, drop = drop, sep = sep, lex.order = lex.order)
    else if (!is.factor(f)) 
        f <- as.factor(f)
    else if (drop) 
        f <- factor(f)
    storage.mode(f) <- "integer"
    if (is.null(attr(x, "class"))) 
        return(.Internal(split(x, f)))
    lf <- levels(f)
    y <- vector("list", length(lf))
    names(y) <- lf
    ind <- .Internal(split(seq_along(x), f))
    for (k in lf) y[[k]] <- x[ind[[k]]]
    y
}

相反,by.data.frame的源代码揭示了对tapply的调用,该调用本身是lapply的包装:

Conversely, source code for by.data.frame reveals a call to tapply which itself is a wrapper to lapply:

getAnywhere(by.data.frame)

function (data, INDICES, FUN, ..., simplify = TRUE) 
{
    if (!is.list(INDICES)) {
        IND <- vector("list", 1L)
        IND[[1L]] <- INDICES
        names(IND) <- deparse(substitute(INDICES))[1L]
    }
    else IND <- INDICES
    FUNx <- function(x) FUN(data[x, , drop = FALSE], ...)
    nd <- nrow(data)
    structure(eval(substitute(tapply(seq_len(nd), IND, FUNx, 
        simplify = simplify)), data), call = match.call(), class = "by")
}

这篇关于R-将大型数据帧并行拆分为列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆