操作大文件 [英] Manipulation of Large Files in R

查看:155
本文介绍了操作大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有15个文件的数据,每个大约4.5GB。每个文件是大约17,000个客户的一个月的价值数据。在一起,数据表示在15个月的过程中的17,000个客户的信息。我想重新格式化这个数据,所以,而不是15个文件,每个表示一个月,我有17,000文件为每个客户和他们的所有数据。我写了一个脚本来做这个:

 #变量'files'是15个月文件的位置的向量
exists = NULL#此向量跟踪有一个为他们创建的文件的客户
for(w在1:15){#为每15个月的文件
month = fread(files [ w],select = c(2,3,6,16))#read in the data我想要
custlist = unique(month $ CustomerID)#a本月文件中所有客户的列表
(i in 1:length(custlist)){#for this customer file in this month file
curcust = custlist [i] #the current customer
newchunk = subset(month,CustomerID == curcust)#all这个客户的数据
filename = sprintf(cust%s,curcust)#what文件名是这个客户的,或者是
if((curcust%in%exists)== TRUE ){#check是否为此客户创建了一个文件。如果一个文件已经被创建,打开它,添加它,然后读回它
custfile = fread(strwrap(sprintf(C:/custFiles/%s.csv,filename)))#读入文件
custfile $ V1 = NULL #remove fread添加的额外列
custfile = rbind(custfile,newchunk)#combine使用我们的新数据读入数据
write.csv(custfile,file = strwrap(sprintf(C:/custFiles/%s.csv,filename)))
} else {#if它尚未创建,将newchunk写入csv
write.csv(newchunk, file = strwrap(sprintf(C:/custFiles/%s.csv,filename)))
exists = rbind(exists,curcust,deparse.level = 0)#将客户添加到现有文件列表
}
}
}

脚本工作'm很肯定)。问题是,它是非常慢。按照我的速度,要完成一个星期或更长时间,我没有那个时间。你做任何一个更好,更快的方式来做这个在R?我应该尝试这样在类似SQL的东西吗?我从来没有真正使用过SQL;你们中的任何人能告诉我这样的事情会做些什么?非常感谢任何输入。

解决方案

由于@Dominic Comtois,我也建议使用SQL。

R可以处理相当大的数据 - 有不错的基准的20亿行,打败python - 但因为R大多在内存中,你需要有一个好的机器,使其工作。仍然你的情况下,不需要一次加载超过4.5GB的文件,因此它应该是在个人计算机上可行,请参阅第二种方法快速非数据库解决方案。

您可以使用R加载数据到SQL数据库,以后从数据库中查询它们。
如果你不知道SQL,你可能想使用一些简单的数据库。从R的最简单的方法是使用RSQLite(不幸的是,因为v1.1它不是 lite 任何更多)。您不需要安装或管理任何外部依赖关系。 RSQLite包包含嵌入的数据库引擎。



 库RSQLite)
library(data.table)
conn< - dbConnect(dbDriver(SQLite),dbname =mydbfile.db)
monthfiles< ,month2)#...
#写数据
(月文件在月文件中){
dbWriteTable(conn,mytablename,fread(monthfile),append = TRUE)
cat(data for,monthfile,loaded to db \\\

}
#查询数据
df< - dbGetQuery(conn,select * from mytablename where customerid = 1)
#当使用更大的数据集时我建议在下面做
setDT(df)
dbDisconnect(conn)

这都是。你使用SQL,而不需要做很多通常与数据库相关的开销。



如果你喜欢从你的帖子中选择方法,我认为你可以通过 write.csv

 图书馆(data.table)
monthfiles< - c(month1,month2)#...
#写数据
(月份文件,月份文件){
fread (monthfile)[,write.csv(.SD,file = paste0(CustomerID,。csv),append = TRUE),by = CustomerID]
cat(data for,monthfile, \\\

}

子集化,而分组也是超快。下面是该方法的工作示例。



  .table)
data.table(a = 1:4,b = 5:6)[,write.csv(.SD,file = paste0(b,。csv)),b]






更新2016-12-05:

从data.table 1.9.8+开始,您可以用 fwrite write.csv >,此答案中的示例。


I have 15 files of data, each around 4.5GB. Each file is a months worth of data for around 17,000 customers. All together, the data represents information on 17,000 customers over the course of 15 months. I want to reformat this data so that, instead of 15 files each denoting a month, I have 17,000 files for each customer and all their data. I wrote a script to do this:

#the variable 'files' is a vector of locations of the 15 month files
exists = NULL  #This vector keeps track of customers who have a file created for them
for (w in 1:15){  #for each of the 15 month files
  month = fread(files[w],select = c(2,3,6,16))  #read in the data I want
  custlist = unique(month$CustomerID) #a list of all customers in this month file
  for (i in 1:length(custlist)){ #for each customer in this month file
    curcust = custlist[i] #the current customer
    newchunk = subset(month,CustomerID == curcust) #all the data for this customer
    filename = sprintf("cust%s",curcust) #what the filename is for this customer will be, or is
    if ((curcust %in% exists) == TRUE){ #check if a file has been created for this customer. If a file has been created, open it, add to it, and read it back
      custfile = fread(strwrap(sprintf("C:/custFiles/%s.csv",filename)))#read in file
      custfile$V1 = NULL #remove an extra column the fread adds
      custfile= rbind(custfile,newchunk)#combine read in data with our new data
      write.csv(custfile,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
    } else { #if it has not been created, write newchunk to a csv
      write.csv(newchunk,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
      exists = rbind(exists,curcust,deparse.level = 0) #add customer to list of existing files
    }
  }
 }

The script works (At least, I'm pretty sure). The problem is that it is incredibly slow. At the rate I'm going, it's going to take a week or more to finish, and I don't have that time. Do any of you a better, faster way to do this in R? Should I try to do this in something like SQL? I've never really used SQL before; could any of you show me how something like this would be done? Any input is greatly appreciated.

解决方案

As the @Dominic Comtois I would also recommend to use SQL.
R can handle quite a biggish data - there is nice benchmark of 2 billions rows which beats python - but because R run mostly in memory you need to have a good machine to make it work. Still your case don't need to load more than 4.5GB file at once so it should be well doable on personal computer, see second approach for fast non-database solution.
You can utilize R to load data to SQL database and later to query them from database. If you don't know SQL you may want to use some simple database. The simplest way from R is to use RSQLite (unfortunately since v1.1 it is not lite any more). You don't need to install or manage any external dependency. The RSQLite package contains the database engine embedded.

library(RSQLite)
library(data.table)
conn <- dbConnect(dbDriver("SQLite"), dbname="mydbfile.db")
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
  dbWriteTable(conn, "mytablename", fread(monthfile), append=TRUE)
  cat("data for",monthfile,"loaded to db\n")
}
# query data
df <- dbGetQuery(conn, "select * from mytablename where customerid = 1")
# when working with bigger sets of data I would recommend to do below
setDT(df)
dbDisconnect(conn)

Thats all. You use SQL without really having to do much overhead usually related to databases.

If you prefer to go with the approach from your post I think you can dramatically speed up by doing write.csv by groups while aggregation in data.table.

library(data.table)
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
  fread(monthfile)[, write.csv(.SD,file=paste0(CustomerID,".csv"), append=TRUE), by=CustomerID]
  cat("data for",monthfile,"written to csv\n")
}

So you utilize fast unique from data.table and perform subsetting while grouping which is also ultra fast. Below is working example of the approach.

library(data.table)
data.table(a=1:4,b=5:6)[,write.csv(.SD,file=paste0(b,".csv")),b]


Update 2016-12-05:
Starting from data.table 1.9.8+ you can replace write.csv with fwrite, example in this answer.

这篇关于操作大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆