使用bigmemory将40 GB csv文件读入R [英] Reading 40 GB csv file into R using bigmemory

查看:171
本文介绍了使用bigmemory将40 GB csv文件读入R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

标题在这里是非常自我解释的,但我会详细说明如下。我目前攻击这个问题的一些技巧是基于在这个问题。然而,我面临着一些挑战和制约因素,所以我想知道是否有人可能试图对这个问题进行攻击。我试图找出使用 bigmemory 包的问题,​​但我一直遇到困难。



存在限制: 使用linux带有16 GB RAM的服务器

  • 大小为40 GB的CSV

  • 行数:67,194,126,114



  • 挑战




    • 需要随机抽样更小的数据集(5-10万行)from big.matrix或等价的数据结构。
    • 需要能够在解析成big.matrix时删除任何具有NULL实例的行等同的数据结构。



    到目前为止,结果并不理想。很明显,我在某些事情上或失败,我只是不明白 bigmemory文档足够好。所以,我想我会问在这里,看看有没有人使用过任何提示,建议在这一行的攻击等等?或者我应该改变为别的东西?我很抱歉,如果这个问题是非常相似的,但我认为按比例的数据比以前的问题大约20倍。谢谢!

    解决方案

    我不知道 bigmemory 满足你的挑战,你不需要读取文件。只需管一些bash / awk / sed / python /任何处理来执行你想要的步骤,即抛出 NULL 行,并随机选择 N 行,然后读入。

    这里是一个使用awk的例子从一个1M行的文件中得到100个随机行)。
    $ b $ $ $ $ $ $ $ $ $ read.csv(pipe('awk -F,\ 'BEGIN {srand(); m = 100; length = 1000000;}
    !/ NULL / {if(rand() print; m - ;
    if(m == 0)exit;
    }} \'filename'
    )) - > df

    对于我而言, NULL ,所以我用字面理解,但应该很容易修改它以适应您的需要。

    The title is pretty self explanatory here but I will elaborate as follows. Some of my current techniques in attacking this problem are based on the solutions presented in this question. However, I am facing several challenges and constraints so I was wondering if someone might attempt to take a stab at this problem. I am trying to figure out the problem using the bigmemory package but I have been running into difficulties.

    Present Constraints:

    • Using a linux server with 16 GB of RAM
    • Size of 40 GB CSV
    • No of rows: 67,194,126,114

    Challenges

    • Need to be able to randomly sample smaller datasets (5-10 Million rows) from a big.matrix or equivalent data structure.
    • Need to be able to remove any row with a single instance of NULL while parsing into a big.matrix or equivalent data structure.

    So far, results are not good. Evidently, I am failing at something or maybe, I just don't understand the bigmemory documentation well enough. So, I thought I would ask here to see if anyone has used

    Any tips, advice on this line of attack etc.? Or should I change to something else? I apologize if this question is very similar to the previous but I thought by scale of data was about 20 times bigger than the previous questions. Thanks !

    解决方案

    I don't know about bigmemory, but to satisfy your challenges you don't need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL lines and randomly select N lines, and then read that in.

    Here's an example using awk (assuming you want 100 random lines from a file that has 1M lines).

    read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;}
                           !/NULL/{if (rand() < m/(length - NR + 1)) {
                                     print; m--;
                                     if (m == 0) exit;
                                  }}\' filename'
            )) -> df
    

    It wasn't obvious to me what you meant by NULL, so I used literal understanding of it, but it should be easy to modify it to fit your needs.

    这篇关于使用bigmemory将40 GB csv文件读入R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆