仍在努力处理大型数据集 [英] Still struggling with handling large data set

查看:85
本文介绍了仍在努力处理大型数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在这个网站上阅读,却找不到确切的答案.如果已经存在,我很抱歉重新发布.

I have been reading around on this website and haven't been able to find the exact answer. If it already exists, I apologize for the repost.

我正在处理非常大的数据集(6亿行,具有32 GB RAM的计算机上为64列).我确实只需要这些数据的较小子集,但是除了简单地使用fread导入一个数据集并选择所需的5列之外,我还很难执行任何功能.之后,我尝试使用所需的特定条件覆盖数据集,但是我碰到了内存上限,并收到消息错误:无法分配4.5 GB的向量大小.我将ff和bigmemory软件包视为替代品,但似乎就像您不能在导入那些软件包之前对其进行子集化?除了在计算机上升级RAM以外,还有其他解决方案吗?

I am working with data sets that are extremely large (600 million rows, 64 columns on a computer with 32 GB of RAM). I really only need much smaller subsets of this data, but am struggling to perform any functions besides simply importing one data set in with fread, and selecting the 5 columns I need. After that, I try to overwrite my dataset with the specific conditions I need, but I hit my RAM cap and get the message "Error: cannot allocate vector size of 4.5 GB. I looked at ff and bigmemory packages as alternatives, but it seems like you can't subset before importing in those packages? Is there any solution to this problem besides upgrading RAM on computer?

我要执行的任务:

>SampleTable<-fread("my.csv", header = T, sep = ",", select=c("column1", "column2", "column7", "column12", "column15"))

>SampleTable2<-SampleTable[SampleTable[,column1=="6" & column7=="1"]]

在这一点上,我击中了记忆上限.尝试使用另一个软件包但导入6亿行的所有64列会更好吗?我也不想花几个小时只执行一次导入.

At this point, I hit my memory cap. Would it be better to try and use another package but import all 64 columns of 600 million rows? I also don't want to spend hours upon hours just to perform one import.

推荐答案

您所能做的就是分块读取CSV文件:

What you could do is read the CSV file in chunks:

# Define only the subset of columns
csv <- "my.csv"
colnames <- names(read.csv(csv, header = TRUE, nrows = 1))
colclasses <- rep(list(NULL), length(colnames))
ind <- c(1, 2, 7, 12, 15)
colclasses[ind] <- "double"

# Read header and first line
library(dplyr)
l_df <- list()
con <- file(csv, "rt")
df <- read.csv(con, header = TRUE, nrows = 1, colClasses = colclasses) %>%
  filter(V1 == 6, V7 == 1)
names(df) <- paste0("V", ind)
l_df[[i <- 1]] <- df

# Read all other lines and combine
repeat {
  i <- i + 1
  df <- read.csv(con, header = FALSE, nrows = 9973, colClasses = colclasses)
  l_df[[i]] <- filter(df, V1 == 6, V7 == 1)
  if (nrow(df) < 9973) break
}
df <- do.call("rbind", l_df)

9973是任意质数,几乎没有机会成为nlines - 1的除数.

9973 is an arbitrary prime number which has few chance to be a divisor of nlines - 1.

这篇关于仍在努力处理大型数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆