使用bigmemory将40 GB csv文件读入R [英] Reading 40 GB csv file into R using bigmemory

查看：171 发布时间：2017/11/4 20:47:07 r memory-management file-io

本文介绍了使用bigmemory将40 GB csv文件读入R的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

标题在这里是非常自我解释的，但我会详细说明如下。我目前攻击这个问题的一些技巧是基于在这个问题。然而，我面临着一些挑战和制约因素，所以我想知道是否有人可能试图对这个问题进行攻击。我试图找出使用 bigmemory 包的问题，但我一直遇到困难。

存在限制： 使用linux带有16 GB RAM的服务器

大小为40 GB的CSV

行数：67,194,126,114

挑战

需要随机抽样更小的数据集（5-10万行）from big.matrix或等价的数据结构。

需要能够在解析成big.matrix时删除任何具有NULL实例的行等同的数据结构。

到目前为止，结果并不理想。很明显，我在某些事情上或失败，我只是不明白 bigmemory文档足够好。所以，我想我会问在这里，看看有没有人使用过任何提示，建议在这一行的攻击等等？或者我应该改变为别的东西？我很抱歉，如果这个问题是非常相似的，但我认为按比例的数据比以前的问题大约20倍。谢谢！
解决方案
我不知道 bigmemory 满足你的挑战，你不需要读取文件。只需管一些bash / awk / sed / python /任何处理来执行你想要的步骤，即抛出 NULL 行，并随机选择 N 行，然后读入。

这里是一个使用awk的例子从一个1M行的文件中得到100个随机行）。
$ b $ $ $ $ $ $ $ $ $ read.csv（pipe（'awk -F，\ 'BEGIN {srand（）; m = 100; length = 1000000;}
！/ NULL / {if（rand（） print; m - ;
if（m == 0）exit;
}} \'filename'
）） - > df

对于我而言，NULL ，所以我用字面理解，但应该很容易修改它以适应您的需要。
The title is pretty self explanatory here but I will elaborate as follows. Some of my current techniques in attacking this problem are based on the solutions presented in this question. However, I am facing several challenges and constraints so I was wondering if someone might attempt to take a stab at this problem. I am trying to figure out the problem using the bigmemory package but I have been running into difficulties.
Present Constraints: Using a linux server with 16 GB of RAM Size of 40 GB CSV No of rows: 67,194,126,114 Challenges Need to be able to randomly sample smaller datasets (5-10 Million rows) from a big.matrix or equivalent data structure. Need to be able to remove any row with a single instance of NULL while parsing into a big.matrix or equivalent data structure. So far, results are not good. Evidently, I am failing at something or maybe, I just don't understand the bigmemory documentation well enough. So, I thought I would ask here to see if anyone has used Any tips, advice on this line of attack etc.? Or should I change to something else? I apologize if this question is very similar to the previous but I thought by scale of data was about 20 times bigger than the previous questions. Thanks ! 解决方案 I don't know about bigmemory, but to satisfy your challenges you don't need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL lines and randomly select N lines, and then read that in. Here's an example using awk (assuming you want 100 random lines from a file that has 1M lines). read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;} !/NULL/{if (rand() < m/(length - NR + 1)) { print; m--; if (m == 0) exit; }}\' filename' )) -> df It wasn't obvious to me what you meant by NULL, so I used literal understanding of it, but it should be easy to modify it to fit your needs. 这篇关于使用bigmemory将40 GB csv文件读入R的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
查看全文

使用bigmemory将40 GB csv文件读入R [英] Reading 40 GB csv file into R using bigmemory

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用bigmemory将40 GB csv文件读入R [英] Reading 40 GB csv file into R using bigmemory

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭