在R中使用ffdfdply函数进行聚合 [英] aggregation using ffdfdply function in R
问题描述
我尝试在R中使用 ffdfdply
函数使用'ffbase'包对大型数据集进行聚合。
可以说我有三个名为Date,Item的变量和销售。在这里,我想使用sum函数汇总Date和Item上的销售额。
在这里我这样尝试过:
I tried aggregation on large dataset using 'ffbase' package using ffdfdply
function in R.
lets say I have three variables called Date,Item and sales. Here I want to aggregate the sales over Date and Item using sum function. Could you please guide me through some proper syntax in R.
Here I tried like this:
grp_qty <- ffdfdply(x=data[c("sales","Date","Item")], split=as.character(data$sales),FUN = function(data)
summaryBy(Date+Item~sales, data=data, FUN=sum)).
对于您的解决方案,我将不胜感激。
I would appreciate for your solution.
推荐答案
标记ffdfdply是ffbase的一部分,而不是ff。
为了显示ffdfdply用法的示例,让我们生成一个具有50Mio行的 ffdf
。
Mark that ffdfdply is part of ffbase, not ff.
To show an example of the usage of ffdfdply, let's generate an ffdf
with 50Mio rows.
require(ffbase)
data <- expand.ffgrid(Date = ff(seq.Date(Sys.Date(), Sys.Date()+10000, by = "day")), Item = ff(factor(paste("Item", 1:5000))))
data$sales <- ffrandom(n = nrow(data))
# split by date -> assuming that all sales of 1 date can fit into RAM
splitby <- as.character(data$Date, by = 250000)
grp_qty <- ffdfdply(x=data[c("sales","Date","Item")],
split=splitby,
FUN = function(data){
## This happens in RAM - containing **several** split elements so here we can use data.table which works fine for in RAM computing
require(data.table)
data <- as.data.table(data)
result <- data[, list(sales = sum(sales, na.rm=TRUE)), by = list(Date, Item)]
as.data.frame(result)
})
dim(grp_qty)
将grp_qty标记为 ffdf
驻留在磁盘上。
Mark that grp_qty is an ffdf
which resides on disk.
这篇关于在R中使用ffdfdply函数进行聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!