修剪一个巨大的(3.5 GB)csv文件读入R [英] Trimming a huge (3.5 GB) csv file to read into R

查看:117
本文介绍了修剪一个巨大的(3.5 GB)csv文件读入R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个数据文件(分号分隔),有很多细节和不完整的行(领先的Access和SQL窒息)。它的县级数据集分成段,子段和子子段(总共约200个因素)40年。总之,它是巨大的,如果我尝试简单地读它不会适应内存。



所以我的问题是这样,因为我想要所有的县,但只有一年(而且只是最高级别的段......最终导致约100,000行),那么最好的方式是将这个汇总到R?



目前,我试图用Python来剔除不相干的年份,通过一次一行读取和操作来达到文件大小限制,但我更喜欢一个R-only解决方案(CRAN packages OK )。是否有类似的方法,在R中每次读取一个文件?



任何想法都将非常感激。



更新:




  • 约束


    • 需要使用我的机器,因此没有EC2实例

    • 只能使用R-only。速度和资源在这种情况下不关心...提供我的机器不爆炸...

    • 如下所示,数据包含混合类型,我需要操作

    • 数据


      • 数据为3.5GB,大约850万行和17列

      • 几千行(〜2k)格式不正确,只有一列而不是17

        • 这些完全不重要,




资料范例:

 州;年; 25美分硬币;分割;子段;子子段; GDP; ... 
Ada县; NC; 2009; 4; FIRE;金融;银行; 80.1; ...
Ada县; NC; 2010; 1; FIRE;金融;银行; 82.5; ...
NC [格式错误的行]
[8.5米行]

我想删除一些列,并从40个可用年份中选择两个(2009-2010年从1980-2020年),以便数据可以适应R:

 县;州;年; 25美分硬币;分割; GDP; ... 
Ada County; NC; 2009; 4; FIRE; 80.1; ...
Ada County; NC; 2010; 1; FIRE; 82.5; ...
[〜200,000 rows]



在对所有建议进行修改之后,我决定由JD和Marek建议的readLines将会工作得最好。我给了Marek的支票,因为他给了一个示例实现。



我已经转载了一个略微改编的Marek的实现我的最终答案在这里,使用strsplit和cat保持

$ p

还应该注意的是,这比Python更低效。 GB文件在5分钟,而R大约60 ...但如果所有你有R,那么这是票。

  ##打开一个连接单独保存光标位置
file.in< - file('bad_data.txt','rt')
file.out< - file('chopped_data.txt' ,'wt')
line< - readLines(file.in,n = 1)
line.split< - strsplit(line,';')
#我们想要的列
cat(line.split [[1]] [1:5],line.split [[1]] [8],sep =';',file = file.out,fill = TRUE )
##使用循环读取其余行
line< - readLines(file.in,n = 1)
while(length(line)){
line.split< - strsplit(line,';')
if(length(line.split [[1]])> 1){
if(line.split [[1]] [3] =='2009'){
cat(line.split [[1]] [1:5],line.split [<1>] [8],sep =';',file = file.out,fill = TRUE)
}
}
行< - readLines(file.in,n = 1)
}
close(file.in)
close(file.out)


b $ b

按方法失败:




  • sqldf

    • 如果数据格式正确,这肯定是我将来用于这种类型的问题。

  • MapReduce

    • 说实话,文档恐吓我一个有点,所以我没有到处试试。

  • bigmemory

  • ul>
  • 此方法与数据完全链接,但它一次只能处理一种类型。结果,所有我的字符向量放在一个big.table。

  • 扫描
  • $ b $如果我需要设计大型数据集以备将来使用, b
    • 扫描似乎具有与大内存类似的类型问题,但具有readLines的所有机制。总之,这次不符合帐单。


    解决方案

    我尝试用 readLines 。这段代码在选定年份创建 csv

      file_in< ;  -  file(in.csv,r)
    file_out< - file(out.csv,a)
    x< - readLines(file_in,n = 1)
    writeLines(x,file_out)#copy headers

    B < - 300000#取决于一个包的大小
    while(length(x)){
    ind< ; - grep(^ [^;] *; [^;] *; 20(09 | 10),x)
    if(length(ind))writeLines(x [ind],file_out)
    x < - readLines(file_in,n = B)
    }
    close(file_in)
    close(file_out)


    So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke). It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. In short, it's huge, and it's not going to fit into memory if I try to simply read it.

    So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment... leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R?

    Currently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). Is there a similar way to read in files a piece at a time in R?

    Any ideas would be greatly appreciated.

    Update:

    • Constraints
      • Needs to use my machine, so no EC2 instances
      • As R-only as possible. Speed and resources are not concerns in this case... provided my machine doesn't explode...
      • As you can see below, the data contains mixed types, which I need to operate on later
    • Data
      • The data is 3.5GB, with about 8.5 million rows and 17 columns
      • A couple thousand rows (~2k) are malformed, with only one column instead of 17
        • These are entirely unimportant and can be dropped
      • I only need ~100,000 rows out of this file (See below)

    Data example:

    County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
    Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
    Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
    NC  [Malformed row]
    [8.5 Mill rows]
    

    I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R:

    County; State; Year; Quarter; Segment; GDP; ...
    Ada County;NC;2009;4;FIRE;80.1; ...
    Ada County;NC;2010;1;FIRE;82.5; ...
    [~200,000 rows]
    

    Results:

    After tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I gave Marek the check because he gave a sample implementation.

    I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want.

    It should also be noted this is MUCH less efficient than Python... as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60... but if all you have is R then this is the ticket.

    ## Open a connection separately to hold the cursor position
    file.in <- file('bad_data.txt', 'rt')
    file.out <- file('chopped_data.txt', 'wt')
    line <- readLines(file.in, n=1)
    line.split <- strsplit(line, ';')
    # Stitching together only the columns we want
    cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
    ## Use a loop to read in the rest of the lines
    line <- readLines(file.in, n=1)
    while (length(line)) {
      line.split <- strsplit(line, ';')
      if (length(line.split[[1]]) > 1) {
        if (line.split[[1]][3] == '2009') {
            cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
        }
      }
      line<- readLines(file.in, n=1)
    }
    close(file.in)
    close(file.out)
    

    Failings by Approach:

    • sqldf
      • This is definitely what I'll use for this type of problem in the future if the data is well-formed. However, if it's not, then SQLite chokes.
    • MapReduce
      • To be honest, the docs intimidated me on this one a bit, so I didn't get around to trying it. It looked like it required the object to be in memory as well, which would defeat the point if that were the case.
    • bigmemory
      • This approach cleanly linked to the data, but it can only handle one type at a time. As a result, all my character vectors dropped when put into a big.table. If I need to design large data sets for the future though, I'd consider only using numbers just to keep this option alive.
    • scan
      • Scan seemed to have similar type issues as big memory, but with all the mechanics of readLines. In short, it just didn't fit the bill this time.

    解决方案

    My try with readLines. This piece of a code creates csv with selected years.

    file_in <- file("in.csv","r")
    file_out <- file("out.csv","a")
    x <- readLines(file_in, n=1)
    writeLines(x, file_out) # copy headers
    
    B <- 300000 # depends how large is one pack
    while(length(x)) {
        ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
        if (length(ind)) writeLines(x[ind], file_out)
        x <- readLines(file_in, n=B)
    }
    close(file_in)
    close(file_out)
    

    这篇关于修剪一个巨大的(3.5 GB)csv文件读入R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆