使用包 data.table 中的 fread 一次读取块 [英] Reading in chunks at a time using fread in package data.table

查看:16
本文介绍了使用包 data.table 中的 fread 一次读取块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 data.table 包中的 fread 函数输入一个以制表符分隔的大文件(大约 2GB).然而,因为它太大了,它并不完全适合内存.我尝试使用 skipnrow 参数分块输入它,例如:

I'm trying to input a large tab-delimited file (around 2GB) using the fread function in package data.table. However, because it's so large, it doesn't fit completely in memory. I tried to input it in chunks by using the skip and nrow arguments such as:

chunk.size = 1e6
done = FALSE
chunk = 1
while(!done)
{
    temp = fread("myfile.txt",skip=(chunk-1)*chunk.size,nrow=chunk.size-1)
    #do something to temp
    chunk = chunk + 1
    if(nrow(temp)<2) done = TRUE
}

在上面的例子中,我一次读取 100 万行,对它们执行计算,然后得到下一个 100 万行,等等.这段代码的问题是,在检索到每个块之后,fread 需要从头开始扫描文件,因为在每次循环迭代之后,skip 都会增加一百万.结果,在每个块之后,fread 需要越来越长的时间才能真正到达下一个块,这使得效率非常低.

In the case above, I'm reading in 1 million rows at a time, performing a calculation on them, and then getting the next million, etc. The problem with this code is that after every chunk is retrieved, fread needs to start scanning the file from the very beginning since after every loop iteration, skip increases by a million. As a result, after every chunk, fread takes longer and longer to actually get to the next chunk making this very inefficient.

有没有办法告诉 fread 每说 100 万行就暂停一次,然后从那一刻开始继续阅读,而不必从头开始重新开始?任何解决方案,或者这应该是一个新的功能请求?

Is there a way to tell fread to pause every say 1 million lines, and then continue reading from that point on without having to restart at the beginning? Any solutions, or should this be a new feature request?

推荐答案

你应该使用 LaF 包.这在您的数据上引入了一种指针,从而避免了 - 对于非常大的数据 - 读取整个文件的烦人行为.据我所知, data.table 中的 fread() pckg 需要知道总行数,这需要 GB 数据的时间.使用 LaF 中的指针,您可以转到所需的每一行;并读取可以应用函数的数据块,然后继续处理下一个数据块.在我的小型 PC 上,我以 10e6 行的步长运行了一个 25 GB 的 csv 文件,并提取了所需的大约 5e6 个观察值——每个 10e6 块需要 30 秒.

You should use the LaF package. This introduces a sort of pointer on your data, thus avoiding the - for very large data - annoying behaviour of reading the whole file. As far as I get it fread() in data.table pckg need to know total number of rows, which takes time for GB data. Using pointer in LaF you can go to every line(s) you want; and read in chunks of data that you can apply your function on, then move on to next chunk of data. On my small PC I ran trough a 25 GB csv-file in steps of 10e6 lines and extracted the totally ~5e6 observations needed - each 10e6 chunk took 30 seconds.

更新:

library('LaF')
huge_file <- 'C:/datasets/protein.links.v9.1.txt'

#First detect a data model for your file:
model <- detect_dm_csv(huge_file, sep=" ", header=TRUE)

然后使用模型创建到您的文件的连接:

Then create a connection to your file using the model:

df.laf <- laf_open(model)

一旦完成,您就可以做各种事情,而无需像 data.table pckgs 中那样知道文件的大小.例如,将指针放在第 100e6 行并从此处读取 1e6 行数据:

Once done you can do all sort of things without needing to know the size of the file as in data.table pckgs. For instance place the pointer to line no 100e6 and read 1e6 lines of data from here:

goto(df.laf, 100e6)
data <- next_block(df.laf,nrows=1e6)

现在 data 包含 CSV 文件的 1e6 行(从第 100e6 行开始).

Now data contains 1e6 lines of your CSV file (starting from line 100e6).

您可以读取数据块(大小取决于您的内存)并且只保留您需要的数据.例如我的示例中的 huge_file 指向一个包含所有已知蛋白质序列的文件,大小>27 GB——对我的电脑来说太大了.为了只获得人类序列,我使用有机体 id 进行过滤,人类的 9606 应该出现在变量 protein1 的开头.一种肮脏的方法是将其放入一个简单的 for 循环中,一次只读取一个数据块:

You can read in chunks of data (size depending on your memory) and only keep what you need. e.g. the huge_file in my example points to a file with all known protein sequences and has a size of >27 GB - way to big for my PC. To get only human sequence I filtered using organism id which is 9606 for human, and this should appear in start of the variable protein1. A dirty way is to put it into a simple for-loop and just go read one data chunk at a time:

library('dplyr')
library('stringr')

res <- df.laf[1,][0,]
for(i in 1:10){
  raw <-
    next_block(df.laf,nrows=100e6) %>% 
    filter(str_detect(protein1,"^9606\."))
  res <- rbind(res, raw)

    }

现在 res 包含过滤后的人类数据.但更好 - 对于更复杂的操作,例如即时计算数据 - 函数 process_blocks() 将函数作为参数.因此,在函数中,您可以对每条数据执行任何您想要的操作.阅读文档.

Now res contains the filtered human data. But better - and for more complex operations, e.g. calculation on data on-the-fly - the function process_blocks() takes as argument a function. Hence in the function you do what ever you want at each piece of data. Read the documentation.

这篇关于使用包 data.table 中的 fread 一次读取块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆