不适用于大对象-“序列化太大而无法存储在原始向量中" [英] mclapply with big objects - "serialization is too large to store in a raw vector"

查看:107
本文介绍了不适用于大对象-“序列化太大而无法存储在原始向量中"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直遇到multicore包和大对象的问题.基本思想是我正在使用Bioconductor函数(readBamGappedAlignments)来读取大型对象.我有一个文件名的字符向量,并且我一直在使用mclapply遍历文件并将它们读入列表.该函数看起来像这样:

I keep hitting an issue with the multicore package and big objects. The basic idea is that I'm using a Bioconductor function (readBamGappedAlignments) to read in large objects. I have a character vector of filenames, and I've been using mclapply to loop over the files and read them into a list. The function looks something like this:

objects <- mclapply(files, function(x) {
  on.exit(message(sprintf("Completed: %s", x)))
  message(sprintf("Started: '%s'", x))
  readBamGappedAlignments(x)
}, mc.cores=10)

但是,我不断收到以下错误:Error: serialization is too large to store in a raw vector.但是,看来我可以单独读取相同的文件,而不会出现此错误.我发现在此处 ,但没有分辨率.

However, I keep getting the following error: Error: serialization is too large to store in a raw vector. However, it seems I can read the same files in alone without this error. I've found mention of this issue here, without resolution.

任何并行解决方案建议将不胜感激-这必须并行进行.我可以期待下雪,但是我有一个非常强大的服务器,该服务器具有15个处理器,每个处理器8个内核和256GB内存.我宁愿只在跨内核的这台计算机上执行此操作,而不要使用我们的集群之一.

Any parallel solution suggestions would be appreciated - this has to be done in parallel. I could look towards snow, but I have a very powerful server with 15 processors, 8 cores each and 256GB of memory I can do this on. I rather just do it on this machine across cores, rather than using one of our clusters.

推荐答案

有传言说R中很快会解决整数限制.根据我的经验,该限制可能会阻止单元数少于20亿的数据集(大约最大整数),和multicore包中的sendMaster之类的底层函数依赖于传递原始向量.我有大约100万个进程,代表data.table格式的约4亿行数据和8亿个单元,而当mclapply将结果发送回时,它遇到了这个限制.

The integer limit is rumored to be addressed very soon in R. In my experience that limit can block datasets with under 2 billion cells (around the maximum integer), and low level functions like sendMaster in the multicore package rely on passing raw vectors. I had around 1 million processes representing about 400 million rows of data and 800 million cells in the data.table format, and when mclapply was sending the results back it ran into this limit.

分而治之的策略并不难,而且行之有效.我意识到这是一种hack,应该可以依靠mclapply.

A divide and conquer strategy is not that hard and it works. I realize this is a hack and one should be able to rely on mclapply.

创建一个列表列表,而不是一个大列表.每个子列表都比分解的版本小,然后您将它们逐个输入到mclapply中.将此称为file_map.结果是一个列表列表,因此您可以使用特殊的双连接do.call函数.结果,每次mclapply完成时,序列化原始向量的大小都是可以控制的.

Instead of one big list, create a list of lists. Each sub-list is smaller than the broken version, and you then feed them into mclapply split by split. Call this file_map. The results are a list of lists, so you could then use the special double concatenate do.call function. As a result, each time mclapply finishes the size of the serialized raw vector is of a manageable size.

只需循环遍历较小的部分:

Just loop over the smaller pieces:

collector = vector("list", length(file_map)) # more complex than normal for speed 

for(index in 1:length(file_map)) {
reduced_set <- mclapply(file_map[[index]], function(x) {
      on.exit(message(sprintf("Completed: %s", x)))
      message(sprintf("Started: '%s'", x))
      readBamGappedAlignments(x)
    }, mc.cores=10)
collector[[index]]= reduced_set

}

output = do.call("c",do.call('c', collector)) # double concatenate of the list of lists

或者,将输出保存到数据库中,例如SQLite.

Alternately, save the output to a database as you go such as SQLite.

这篇关于不适用于大对象-“序列化太大而无法存储在原始向量中"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆