一次使用fread in package data.table读取块 [英] Reading in chunks at a time using fread in package data.table

查看:168
本文介绍了一次使用fread in package data.table读取块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用包中的 fread 函数输入一个大的制表符分隔文件(大约2GB) data.table 。然而,因为它是如此之大,它不完全适合内存。我尝试使用 skip nrow 参数输入它,如:

I'm trying to input a large tab-delimited file (around 2GB) using the fread function in package data.table. However, because it's so large, it doesn't fit completely in memory. I tried to input it in chunks by using the skip and nrow arguments such as:

chunk.size = 1e6
done = FALSE
chunk = 1
while(!done)
{
    temp = fread("myfile.txt",skip=(chunk-1)*chunk.size,nrow=chunk.size-1)
    #do something to temp
    chunk = chunk + 1
    if(nrow(temp)<2) done = TRUE
}

在上面的情况下,我一次读取100万行,对它们执行计算,然后获得下一百万等。此代码的问题是,在每个块被检索后, fread 需要从最初开始扫描文件,因为每次循环迭代后, skip 增加了一百万。结果,在每个块之后, fread 需要更长和更长的时间才能实际到达下一个块,这使得这个效率非常低。

In the case above, I'm reading in 1 million rows at a time, performing a calculation on them, and then getting the next million, etc. The problem with this code is that after every chunk is retrieved, fread needs to start scanning the file from the very beginning since after every loop iteration, skip increases by a million. As a result, after every chunk, fread takes longer and longer to actually get to the next chunk making this very inefficient.

有没有办法让 fread 暂停每一百万行,然后从该点继续读而不必在开始时重新启动?

Is there a way to tell fread to pause every say 1 million lines, and then continue reading from that point on without having to restart at the beginning? Any solutions, or should this be a new feature request?

推荐答案

您应该使用 LaF 包。这会在数据上引入一种指针,从而避免 - 对于非常大的数据 - 读取整个文件的恼人行为。只要我得到 fread() data.table pckg需要知道总行数,时间为GB数据。
使用 LaF 中的指针,你可以去你想要的每一行;并读取可以应用您的函数的数据块,然后转到下一个数据块。在我的小电脑上,我跑了一个25 GB的csv文件,步骤为10e6行,提取了完全〜5e6观察所需 - 每10e6块花了30秒。

You should use the LaF package. This introduces a sort of pointer on your data, thus avoiding the - for very large data - annoying behaviour of reading the whole file. As far as I get it fread() in data.table pckg need to know total number of rows, which takes time for GB data. Using pointer in LaF you can go to every line(s) you want; and read in chunks of data that you can apply your function on, then move on to next chunk of data. On my small PC I ran trough a 25 GB csv-file in steps of 10e6 lines and extracted the totally ~5e6 observations needed - each 10e6 chunk took 30 seconds.

更新:

library('LaF')
huge_file <- 'C:/datasets/protein.links.v9.1.txt'

#First detect a data model for your file:
model <- detect_dm_csv(huge_file, sep=" ", header=TRUE)

然后使用模型创建与文件的连接:

Then create a connection to your file using the model:

df.laf <- laf_open(model)

一旦完成,你可以做所有的事情,而不需要知道文件的大小,如data.table pckgs。例如,将指针放在行100e6,并从这里读取1e6行数据:

Once done you can do all sort of things without needing to know the size of the file as in data.table pckgs. For instance place the pointer to line no 100e6 and read 1e6 lines of data from here:

goto(df.laf, 100e6)
data <- next_block(df.laf,nrows=1e6)

c $ c> data 包含您的CSV文件的1e6行(从第100e6行开始)。

Now data contains 1e6 lines of your CSV file (starting from line 100e6).

您可以读取数据块取决于你的记忆),只保留你所需要的。例如在我的例子中的 huge_file 指向一个具有所有已知蛋白质序列的文件,并且大小> 27 GB - 对于我的电脑大。为了仅得到人类序列I,使用有机体id(对于人类为9606)进行过滤,并且这应该出现在变量 protein1 的开始。一个肮脏的方式是把它放入一个简单的for循环,只需一次读取一个数据块:

You can read in chunks of data (size depending on your memory) and only keep what you need. e.g. the huge_file in my example points to a file with all known protein sequences and has a size of >27 GB - way to big for my PC. To get only human sequence I filtered using organism id which is 9606 for human, and this should appear in start of the variable protein1. A dirty way is to put it into a simple for-loop and just go read one data chunk at a time:

library('dplyr')
library('stringr')
    res <- df.laf[1,][0,]

for(i in 1:10){

      raw <-
        next_block(df.laf,nrows=100e6) %>% 
        filter(str_detect(protein1,"^9606\\."))
      res <- rbind(res, raw)

    }

现在 res 包含已过滤的人类数据。但更好 - 对于更复杂的操作,例如。对数据的计算 - 函数 process_blocks()以函数作为参数。因此,在函数中,你做你想要的每一块数据。阅读文档。

Now res contains the filtered human data. But better - and for more complex operations, e.g. calculation on data on-the-fly - the function process_blocks() takes as argument a function. Hence in the function you do what ever you want at each piece of data. Read the documentation.

这篇关于一次使用fread in package data.table读取块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆