使用FF包在R中创建和重塑大数据的函数 [英] Functions for creating and reshaping big data in R using the FF package
问题描述
我是R和FF包的新手,并试图更好地了解FF如何使用户使用大型数据集(> 4Gb).我花了很多时间在网上浏览教程,但我通常能找到的这些问题使我感到头疼.
I'm new to R and the FF package, and am trying to better understand how FF allows users to work with large datasets (>4Gb). I have spent a considerable amount of time trawling the web for tutorials, but the ones I could find generally go over my head.
我通过做事学到了最好的东西,因此,作为一个练习,我想知道如何使用任意值创建长格式的时间序列数据集,类似于R的内置"Indometh"数据集.然后,我想将其重塑为宽幅格式.然后,我想将输出另存为csv文件.
I learn best by doing, so as an exercise, I would like to know how to create a long-format time-series dataset, similar to R's in-built "Indometh" dataset, using arbitrary values. Then I would like to reshape it into wide format. Then I would like to save the output as a csv file.
对于小型数据集,这很简单,可以使用以下脚本来实现:
With small datasets this is simple, and can be achieved using the following script:
##########################################
#Generate the data frame
DF<-data.frame()
for(Subject in 1:6){
for(time in 1:11){
DF<-rbind(DF,c(Subject,time,runif(1)))
}
}
names(DF)<-c("Subject","time","conc")
##########################################
#Reshape to wide format
DF<-reshape(DF, v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")
##########################################
#Save csv file
write.csv(DF,file="DF.csv")
但是我想学习如何针对大约10 Gb的文件大小执行此操作.我将如何使用FF软件包来做到这一点?预先感谢.
But I would like to learn to do this for file sizes of approximately 10 Gb. How would I do this using the FF package? Thanks in advance.
推荐答案
对于ffdf对象,函数reshape
不明确存在.但是使用包ffbase
中的功能执行起来非常简单.只需使用软件包ffbase
中的ffdfdply,按Subject拆分,然后在函数内应用reshape
.
The function reshape
does not explicitly exists for ffdf objects. But it is quite straightforward to execute with functionality from package ffbase
. Just use ffdfdply from package ffbase
, split by Subject and apply reshape
inside the function.
Indometh数据集中包含1000000个主题的示例.
An example on the Indometh dataset with 1000000 subjects.
require(ffbase)
require(datasets)
data(Indometh)
## Generate some random data
x <- expand.ffgrid(Subject = ff(factor(1:1000000)), time = ff(unique(Indometh$time)))
x$conc <- ffrandom(n=nrow(x), rfun = rnorm)
dim(x)
[1] 11000000 3
## and reshape to wide format
result <- ffdfdply(x=x, split=x$Subject, FUN=function(datawithseveralsplitelements){
df <- reshape(datawithseveralsplitelements,
v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")
as.data.frame(df)
})
class(result)
[1] "ffdf"
colnames(result)
[1] "Subject" "conc.0.25" "conc.0.5" "conc.0.75" "conc.1" "conc.1.25" "conc.2" "conc.3" "conc.4" "conc.5" "conc.6" "conc.8"
dim(result)
[1] 1000000 12
这篇关于使用FF包在R中创建和重塑大数据的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!