修剪一个巨大的 (3.5 GB) csv 文件以读入 R [英] Trimming a huge (3.5 GB) csv file to read into R

查看:36
本文介绍了修剪一个巨大的 (3.5 GB) csv 文件以读入 R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个数据文件(分号分隔),它有很多细节和不完整的行(导致 Access 和 SQL 阻塞).它是县级数据集,分为段、子段和子子段(总共约 200 个因素),持续 40 年.简而言之,它很大,如果我试图简单地阅读它,它就无法放入内存.

So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke). It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. In short, it's huge, and it's not going to fit into memory if I try to simply read it.

所以我的问题是,鉴于我想要所有县,但只有一年(并且只是最高级别的细分......最终导致大约 100,000 行),什么是最好的方法打算将此汇总到 R 中吗?

So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment... leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R?

目前我正试图用 Python 剔除不相关的年份,通过一次读取和操作一行来绕过文件大小限制,但我更喜欢 R-only 解决方案(CRAN 包可以).有没有类似的方法在 R 中一次读取一个文件?

Currently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). Is there a similar way to read in files a piece at a time in R?

任何想法将不胜感激.

更新:

  • 限制条件
    • 需要使用我的机器,所以没有 EC2 实例
    • 尽可能仅使用 R.在这种情况下,速度和资源不是问题……只要我的机器不爆炸……
    • 如下图所示,数据包含混合类型,我稍后需要对其进行操作
    • Constraints
      • Needs to use my machine, so no EC2 instances
      • As R-only as possible. Speed and resources are not concerns in this case... provided my machine doesn't explode...
      • As you can see below, the data contains mixed types, which I need to operate on later
      • 数据为 3.5GB,大约有 850 万行和 17 列
      • 几千行 (~2k) 的格式不正确,只有一列而不是 17 列
        • 这些完全不重要,可以删除

        数据示例:

        County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
        Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
        Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
        NC  [Malformed row]
        [8.5 Mill rows]
        

        我想删掉一些列并从 40 个可用年份(2009-2010 年,1980-2020 年)中挑选两个,以便数据适合 R:

        I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R:

        County; State; Year; Quarter; Segment; GDP; ...
        Ada County;NC;2009;4;FIRE;80.1; ...
        Ada County;NC;2010;1;FIRE;82.5; ...
        [~200,000 rows]
        

        结果:

        在修改了所有提出的建议后,我决定使用 JD 和 Marek 建议的 readLines 效果最好.我给了 Marek 支票,因为他提供了一个示例实现.

        After tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I gave Marek the check because he gave a sample implementation.

        我在这里为我的最终答案复制了稍微改编的 Marek 实现版本,使用 strsplit 和 cat 只保留我想要的列.

        I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want.

        还应该注意的是,这比 Python 效率低得多……例如,Python 在 5 分钟内处理 3.5GB 的文件,而 R 大约需要 60 分钟……但如果您愿意有是 R 那么这是票.

        It should also be noted this is MUCH less efficient than Python... as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60... but if all you have is R then this is the ticket.

        ## Open a connection separately to hold the cursor position
        file.in <- file('bad_data.txt', 'rt')
        file.out <- file('chopped_data.txt', 'wt')
        line <- readLines(file.in, n=1)
        line.split <- strsplit(line, ';')
        # Stitching together only the columns we want
        cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
        ## Use a loop to read in the rest of the lines
        line <- readLines(file.in, n=1)
        while (length(line)) {
          line.split <- strsplit(line, ';')
          if (length(line.split[[1]]) > 1) {
            if (line.split[[1]][3] == '2009') {
                cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
            }
          }
          line<- readLines(file.in, n=1)
        }
        close(file.in)
        close(file.out)
        

        方法失败:

        • sqldf
          • 如果数据格式正确,这绝对是我将来用于解决此类问题的方法.但是,如果不是,那么 SQLite 就会窒息.
          • 老实说,这些文档让我有点害怕,所以我没有去尝试.看起来它也需要对象在内存中,如果是这样的话,这将失去意义.
          • 这种方法干净利落地链接到数据,但它一次只能处理一种类型.结果,当放入 big.table 时,我所有的字符向量都掉了.不过,如果我需要为未来设计大型数据集,我会考虑仅使用数字来保持此选项的有效性.
          • Scan 似乎具有与大内存类似的类型问题,但具有 readLines 的所有机制.简而言之,这一次不符合要求.

          推荐答案

          我对 readLines 的尝试.这段代码使用选定的年份创建 csv.

          My try with readLines. This piece of a code creates csv with selected years.

          file_in <- file("in.csv","r")
          file_out <- file("out.csv","a")
          x <- readLines(file_in, n=1)
          writeLines(x, file_out) # copy headers
          
          B <- 300000 # depends how large is one pack
          while(length(x)) {
              ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
              if (length(ind)) writeLines(x[ind], file_out)
              x <- readLines(file_in, n=B)
          }
          close(file_in)
          close(file_out)
          

          这篇关于修剪一个巨大的 (3.5 GB) csv 文件以读入 R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆