使用FF包在R中创建和重塑大数据的函数 [英] Functions for creating and reshaping big data in R using the FF package

查看:404
本文介绍了使用FF包在R中创建和重塑大数据的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是R和FF包的新手,并试图更好地了解FF如何使用户使用大型数据集(> 4Gb).我花了很多时间在网上浏览教程,但我通常能找到的这些问题使我感到头疼.

I'm new to R and the FF package, and am trying to better understand how FF allows users to work with large datasets (>4Gb). I have spent a considerable amount of time trawling the web for tutorials, but the ones I could find generally go over my head.

我通过做事学到了最好的东西,因此,作为一个练习,我想知道如何使用任意值创建长格式的时间序列数据集,类似于R的内置"Indometh"数据集.然后,我想将其重塑为宽幅格式.然后,我想将输出另存为csv文件.

I learn best by doing, so as an exercise, I would like to know how to create a long-format time-series dataset, similar to R's in-built "Indometh" dataset, using arbitrary values. Then I would like to reshape it into wide format. Then I would like to save the output as a csv file.

对于小型数据集,这很简单,可以使用以下脚本来实现:

With small datasets this is simple, and can be achieved using the following script:

##########################################
#Generate the data frame

DF<-data.frame()
for(Subject in 1:6){
  for(time in 1:11){
    DF<-rbind(DF,c(Subject,time,runif(1)))
  }
}
names(DF)<-c("Subject","time","conc")

##########################################
#Reshape to wide format

DF<-reshape(DF, v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")

##########################################
#Save csv file

write.csv(DF,file="DF.csv")

但是我想学习如何针对大约10 Gb的文件大小执行此操作.我将如何使用FF软件包来做到这一点?预先感谢.

But I would like to learn to do this for file sizes of approximately 10 Gb. How would I do this using the FF package? Thanks in advance.

推荐答案

对于ffdf对象,函数reshape不明确存在.但是使用包ffbase中的功能执行起来非常简单.只需使用软件包ffbase中的ffdfdply,按Subject拆分,然后在函数内应用reshape.

The function reshape does not explicitly exists for ffdf objects. But it is quite straightforward to execute with functionality from package ffbase. Just use ffdfdply from package ffbase, split by Subject and apply reshape inside the function.

Indometh数据集中包含1000000个主题的示例.

An example on the Indometh dataset with 1000000 subjects.

require(ffbase)
require(datasets)
data(Indometh)

## Generate some random data
x <- expand.ffgrid(Subject = ff(factor(1:1000000)), time = ff(unique(Indometh$time)))
x$conc <- ffrandom(n=nrow(x), rfun = rnorm)
dim(x)
[1] 11000000        3

## and reshape to wide format
result <- ffdfdply(x=x, split=x$Subject, FUN=function(datawithseveralsplitelements){
  df <- reshape(datawithseveralsplitelements, 
              v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")
  as.data.frame(df)
})
class(result)
[1] "ffdf"
colnames(result)
[1] "Subject"   "conc.0.25" "conc.0.5"  "conc.0.75" "conc.1"    "conc.1.25" "conc.2"    "conc.3"    "conc.4"    "conc.5"    "conc.6"    "conc.8"   
dim(result)
[1] 1000000      12

这篇关于使用FF包在R中创建和重塑大数据的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆