在 R 中处理大文件 [英] Manipulation of Large Files in R

查看:11
本文介绍了在 R 中处理大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 15 个数据文件,每个文件大约 4.5GB.每个文件包含约 17,000 名客户一个月的数据.总之,这些数据代表了 15 个月内 17,000 名客户的信息.我想重新格式化这些数据,而不是每个表示一个月的 15 个文件,而是为每个客户及其所有数据提供 17,000 个文件.我为此编写了一个脚本:

I have 15 files of data, each around 4.5GB. Each file is a months worth of data for around 17,000 customers. All together, the data represents information on 17,000 customers over the course of 15 months. I want to reformat this data so that, instead of 15 files each denoting a month, I have 17,000 files for each customer and all their data. I wrote a script to do this:

#the variable 'files' is a vector of locations of the 15 month files
exists = NULL  #This vector keeps track of customers who have a file created for them
for (w in 1:15){  #for each of the 15 month files
  month = fread(files[w],select = c(2,3,6,16))  #read in the data I want
  custlist = unique(month$CustomerID) #a list of all customers in this month file
  for (i in 1:length(custlist)){ #for each customer in this month file
    curcust = custlist[i] #the current customer
    newchunk = subset(month,CustomerID == curcust) #all the data for this customer
    filename = sprintf("cust%s",curcust) #what the filename is for this customer will be, or is
    if ((curcust %in% exists) == TRUE){ #check if a file has been created for this customer. If a file has been created, open it, add to it, and read it back
      custfile = fread(strwrap(sprintf("C:/custFiles/%s.csv",filename)))#read in file
      custfile$V1 = NULL #remove an extra column the fread adds
      custfile= rbind(custfile,newchunk)#combine read in data with our new data
      write.csv(custfile,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
    } else { #if it has not been created, write newchunk to a csv
      write.csv(newchunk,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
      exists = rbind(exists,curcust,deparse.level = 0) #add customer to list of existing files
    }
  }
 }

脚本有效(至少,我很确定).问题是它非常慢.按照我的进度,完成需要一周或更长时间,而我没有那个时间.你们有没有更好、更快的方法在 R 中做到这一点?我应该尝试用 SQL 之类的方法来执行此操作吗?我以前从未真正使用过 SQL;你们中的任何人都可以告诉我如何完成这样的事情吗?非常感谢任何输入.

The script works (At least, I'm pretty sure). The problem is that it is incredibly slow. At the rate I'm going, it's going to take a week or more to finish, and I don't have that time. Do any of you a better, faster way to do this in R? Should I try to do this in something like SQL? I've never really used SQL before; could any of you show me how something like this would be done? Any input is greatly appreciated.

推荐答案

作为@Dominic Comtois,我也推荐使用 SQL.
R 可以处理相当大的数据 - 有 20 亿行的出色基准测试胜过 python - 但由于 R 主要在内存中运行,因此您需要一台好的机器才能使其工作.您的情况仍然不需要一次加载超过 4.5GB 的文件,因此它在个人计算机上应该是可行的,请参阅快速非数据库解决方案的第二种方法.
您可以利用 R 将数据加载到 SQL 数据库,然后再从数据库中查询它们.如果您不了解 SQL,您可能想使用一些简单的数据库.R 中最简单的方法是使用 RSQLite(不幸的是,从 v1.1 开始,它不再是 lite 了).您不需要安装或管理任何外部依赖项.RSQLite 包包含嵌入的数据库引擎.

As the @Dominic Comtois I would also recommend to use SQL.
R can handle quite a biggish data - there is nice benchmark of 2 billions rows which beats python - but because R run mostly in memory you need to have a good machine to make it work. Still your case don't need to load more than 4.5GB file at once so it should be well doable on personal computer, see second approach for fast non-database solution.
You can utilize R to load data to SQL database and later to query them from database. If you don't know SQL you may want to use some simple database. The simplest way from R is to use RSQLite (unfortunately since v1.1 it is not lite any more). You don't need to install or manage any external dependency. The RSQLite package contains the database engine embedded.

library(RSQLite)
library(data.table)
conn <- dbConnect(dbDriver("SQLite"), dbname="mydbfile.db")
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
  dbWriteTable(conn, "mytablename", fread(monthfile), append=TRUE)
  cat("data for",monthfile,"loaded to db
")
}
# query data
df <- dbGetQuery(conn, "select * from mytablename where customerid = 1")
# when working with bigger sets of data I would recommend to do below
setDT(df)
dbDisconnect(conn)

仅此而已.您可以使用 SQL,而实际上不必做很多通常与数据库相关的开销.

Thats all. You use SQL without really having to do much overhead usually related to databases.

如果您更喜欢使用帖子中的方法,我认为您可以通过在 data.table 中聚合时按组执行 write.csv 来显着加快速度.

If you prefer to go with the approach from your post I think you can dramatically speed up by doing write.csv by groups while aggregation in data.table.

library(data.table)
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
  fread(monthfile)[, write.csv(.SD,file=paste0(CustomerID,".csv"), append=TRUE), by=CustomerID]
  cat("data for",monthfile,"written to csv
")
}

因此,您可以利用 data.table 中的快速唯一性并在分组时执行子集,这也是超快的.以下是该方法的工作示例.

So you utilize fast unique from data.table and perform subsetting while grouping which is also ultra fast. Below is working example of the approach.

library(data.table)
data.table(a=1:4,b=5:6)[,write.csv(.SD,file=paste0(b,".csv")),b]

<小时>

2016 年 12 月 5 日更新:
从 data.table 1.9.8+ 开始,您可以将 write.csv 替换为 fwrite,例如 这个答案.

这篇关于在 R 中处理大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆