R中使用fread的data.table的行限制 [英] Row limit for data.table in R using fread

查看:140
本文介绍了R中使用fread的data.table的行限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道使用data.table fread函数可以读取的行数是否有限制。我正在使用具有40亿行,4列,约40 GB的表。看来,fread仅读取前8.4亿行。它不会给出任何错误,但会返回到R提示符,就好像它已读取所有数据一样!

I wanted to know if there is a limit to the number of rows that can be read using the data.table fread function. I am working with a table with 4 billion rows, 4 columns, about 40 GB. It appears that fread will read only the first ~ 840 million rows. It does not give any errors but returns to the R prompt as if it had read all the data !

我知道目前fread不适用于产品用途 ,并希望了解是否有实现产品发布的时间表。

I understand that fread is not for "prod use" at the moment, and wanted to find out if there was any timeframe for implementation of a prod-release.

我使用data.table的原因是,对于这种大小的文件,与将文件加载到data.frame等相比,它在处理数据方面非常高效。

The reason I am using data.table is that, for files of such sizes, it is extremely efficient at processing the data compared to loading the file in a data.frame, etc.

目前,我正在尝试其他两种选择-

At the moment, I am trying 2 other alternatives -

1)使用扫描并传递给数据表。

1) Using scan and passing on to a data.table

data.table(matrix(scan("file.csv",what="integer",sep=","),ncol=4))

Resulted in --
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  too many items

2)将文件分成多个单独的段,最大限制为。使用Unix分割5亿行,并顺序读取它们……然后依次将文件循环到fread中-有点麻烦,但似乎是唯一可行的解​​决方案。

2) Breaking the file up into multiple individual segments with a limit of approx. 500 million rows using Unix split and reading them sequentially ... then looping over the files sequentially into fread - a bit cumbersome, but appears to be the only workable solution.

我认为可能有一种Rcpp方法可以更快地执行此操作,但不确定其通常如何实现。

I think there may be an Rcpp way to do this even faster, but am not sure how it is generally implemented.

先谢谢了。

推荐答案

我能够使用来自Stackoverflow上另一篇帖子的反馈来完成此任务。该过程非常快,并且使用fread进行了大约10分钟的读取,读取了40 GB的数据。由于某些限制,Foreach-dopar独自运行时无法运行,无法连续读取文件到新的data.tables中。

I was able to accomplish this using feedback from another posting on Stackoverflow. The process was very fast and 40 GB of data was read in about 10 minutes using fread iteratively. Foreach-dopar failed to work when run by itself to read files into new data.tables sequentially due to some limitations which are also mentioned on the page below.

注意:文件列表(file_map)只需运行-

Note: The file list (file_map) was prepared by simply running --

file_map <- list.files(pattern="test.$")  # Replace pattern to suit your requirement

> mclapply与大对象一起使用- serialization太大,无法存储在原始文件中向量

报价-

collector = vector("list", length(file_map)) # more complex than normal for speed 

for(index in 1:length(file_map)) {
reduced_set <- mclapply(file_map[[index]], function(x) {
  on.exit(message(sprintf("Completed: %s", x)))
  message(sprintf("Started: '%s'", x))
  fread(x)             # <----- CHANGED THIS LINE to fread
}, mc.cores=10)
collector[[index]]= reduced_set

}

# Additional line (in place of rbind as in the URL above)

for (i in 1:length(collector)) { rbindlist(list(finalList,yourFunction(collector[[i]][[1]]))) }
# Replace yourFunction as needed, in my case it was an operation I performed on each segment and joined them with rbindlist at the end.

我的函数包括一个使用Foreach dopar的循环,该循环在file_map中指定的每个文件的多个内核上执行。这使我可以在组合文件上运行时使用dopar,而不会遇到序列化太大错误。

My function included a loop using Foreach dopar that executed across several cores per file as specified in file_map. This allowed me to use dopar without encountering the "serialization too large error" when running on the combined file.

另一篇有用的文章是-并行加载文件无法与foreach + data.table一起使用

Another helpful post is at -- loading files in parallel not working with foreach + data.table

这篇关于R中使用fread的data.table的行限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆