使用bigmemory将40 GB csv文件读入R [英] Reading 40 GB csv file into R using bigmemory
问题描述
存在限制: 使用linux带有16 GB RAM的服务器 挑战 到目前为止,结果并不理想。很明显,我在某些事情上或失败,我只是不明白 我不知道 这里是一个使用awk的例子从一个1M行的文件中得到100个随机行)。 对于我而言, The title is pretty self explanatory here but I will elaborate as follows. Some of my current techniques in attacking this problem are based on the solutions presented in this question. However, I am facing several challenges and constraints so I was wondering if someone might attempt to take a stab at this problem. I am trying to figure out the problem using the bigmemory package but I have been running into difficulties. Present Constraints: Challenges So far, results are not good. Evidently, I am failing at something or maybe, I just don't understand the bigmemory documentation well enough. So, I thought I would ask here to see if anyone has used Any tips, advice on this line of attack etc.? Or should I change to something else? I apologize if this question is very similar to the previous but I thought by scale of data was about 20 times bigger than the previous questions. Thanks ! I don't know about Here's an example using awk (assuming you want 100 random lines from a file that has 1M lines). It wasn't obvious to me what you meant by 这篇关于使用bigmemory将40 GB csv文件读入R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
bigmemory
满足你的挑战,你不需要读取文件。只需管一些bash / awk / sed / python /任何处理来执行你想要的步骤,即抛出 NULL
行,并随机选择 N
行,然后读入。
$ b $ $ $ $ $ $ $ $ $ read.csv(pipe('awk -F,\ 'BEGIN {srand(); m = 100; length = 1000000;}
!/ NULL / {if(rand()
if(m == 0)exit;
}} \'filename'
)) - > df
NULL
,所以我用字面理解,但应该很容易修改它以适应您的需要。
bigmemory
, but to satisfy your challenges you don't need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL
lines and randomly select N
lines, and then read that in.read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;}
!/NULL/{if (rand() < m/(length - NR + 1)) {
print; m--;
if (m == 0) exit;
}}\' filename'
)) -> df
NULL
, so I used literal understanding of it, but it should be easy to modify it to fit your needs.