打开前如何在R中过滤非常大的csv? [英] How to filter a very large csv in R prior to opening it?

查看:79
本文介绍了打开前如何在R中过滤非常大的csv?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试在计算机上打开48GB的CSV.不用说,我的RAM不支持这么大的文件,因此我试图在打开之前对其进行过滤.根据我的研究,在R中最合适的方法是使用sqldf lib,更具体地说是read.csv.sql函数:

I'm currently trying to open a 48GB csv on my computer. Needless to say that my RAM does no support such a huge file, so I'm trying to filter it before opening. From what I've researched, the most appropriate way to do so in R is using the sqldf lib, more specifically the read.csv.sql function:

df <- read.csv.sql('CIF_FOB_ITIC-en.csv', sql = "SELECT * FROM file WHERE 'Year' IN (2014, 2015, 2016, 2017, 2018)")

但是,我收到以下消息:

However, I got the following message:

错误:列名重复:度量

Erro: duplicate column name: Measure

由于SQL不区分大小写,因此具有两个变量(一个名为Measure和另一个名为MEASURE)意味着列名重复.为了解决这个问题,我尝试使用header = FALSE参数并将'Year'替换为V9,而产生以下错误:

As SQL is case insensitive, having two variables, one named Measure and another named MEASURE, implies duplicity in column names. To get around this, I tried using the header = FALSE argument and substituted the 'Year' by V9, yielding the following error instead:

connection_import_file中的错误(conn @ ptr,名称,值,sep,eol,跳过) :RS_sqlite_import:CIF_FOB_ITIC-en.csv第2行预期为19列 数据,但发现24

Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : RS_sqlite_import: CIF_FOB_ITIC-en.csv line 2 expected 19 columns of data but found 24

在这种情况下我应该如何进行?

How should I proceed in this case?

提前谢谢!

推荐答案

这是一个Tidyverse解决方案,它读取CSV块,对其进行过滤,然后将结果行堆叠起来.这段代码也并行执行此操作,因此整个文件被扫描,但比一次处理一个块要快得多(取决于您的核心数量),就像使用apply(或purrr::map一样) ).

Here's a Tidyverse solution that reads in chunks of the CSV, filters them, and stacks up the resulting rows. This code also does this in parallel, so the whole file gets scanned, but far more quickly (depending on your core count) than if the chunks were processed one at a time, as with apply (or purrr::map for that matter).

内联评论.

library(tidyverse)
library(furrr)

# Make a CSV file out of the NASA stock dataset for demo purposes
raw_data_path <- tempfile(fileext = ".csv")
nasa %>% as_tibble() %>% write_csv(raw_data_path)

# Get the row count of the raw data, incl. header row, without loading the
# actual data
raw_data_nrow <- length(count.fields(raw_data_path))

# Hard-code the largest batch size you can, given your RAM in relation to the
# data size per row
batch_size    <- 1e3 

# Set up parallel processing of multiple chunks at a time, leaving one virtual
# core, as usual
plan(multiprocess, workers = availableCores() - 1)

filtered_data <- 
  # Define the sequence of start-point row numbers for each chunk (each number
  # is actually the start point minus 1 since we're using the seq. no. as the
  # no. of rows to skip)
  seq(from = 0, 
      # Add the batch size to ensure that the last chunk is large enough to grab
      # all the remainder rows
      to = raw_data_nrow + batch_size, 
      by = batch_size) %>% 
  future_map_dfr(
    ~ read_csv(
      raw_data_path,
      skip      = .x,
      n_max     = batch_size, 
      # Can't read in col. names in each chunk since they're only present in the
      # 1st chunk
      col_names = FALSE,
      # This reads in each column as character, which is safest but slowest and
      # most memory-intensive. If you're sure that each batch will contain
      # enough values in each column so that the type detection in each batch
      # will come to the same conclusions, then comment this out and leave just
      # the guess_max
      col_types = cols(.default = "c"),
      guess_max = batch_size
    ) %>% 
      # This is where you'd insert your filter condition(s)
      filter(TRUE),
    # Progress bar! So you know how many chunks you have left to go
    .progress = TRUE
  ) %>% 
  # The first row will be the header values, so set the column names to equal
  # that first row, and then drop it
  set_names(slice(., 1)) %>% 
  slice(-1)

这篇关于打开前如何在R中过滤非常大的csv?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆