如何在R中优化sapply以计算数据帧上的运行总计 [英] How do I optimize sapply in R to calculate running totals on a dataframe
问题描述
函数和调用函数的方式:
accumulate< - function(recordnum,df){
sumthese < - (df $ subject == df $ subject [recordnum])&
(df $ month< = df $ month [recordnum])
sum(df $ measurement [sumthese])
}
set.seed(42)
datalength = 10
df< - data.frame(measure = runif(1:datalength),
subject = rep(c(dog,cat),each = datalength / 2)
month = rep(seq(datalength / 2,1,by = -1)))
system.time(df $ cumulative< - sapply(1:datalength,accumulate,df))
输入数据框:
code>> df
测量主题月份
1 0.4577418狗5
2 0.7191123狗4
3 0.9346722狗3
4 0.2554288狗2
5 0.4622928狗1
6 0.9400145 cat 5
7 0.9782264 cat 4
8 0.1174874 cat 3
9 0.4749971 cat 2
10 0.5603327 cat 1
/ pre>
输出数据框:
> df
测量主体月累积
1 0.9148060狗5 3.6102141
2 0.9370754狗4 2.6954081
3 0.2861395狗3 1.7583327
4 0.8304476狗2 1.4721931
5 0.6417455狗1 0.6417455
6 0.5190959猫5 2.7524079
7 0.7365883猫4 2.2333120
8 0.1346666猫3 1.4967237
9 0.6569923猫2 1.3620571
10 0.7050648猫1 0.7050648
注意,累积列显示了直到并包括当前月份的所有度量的积累。该功能不需要对数据帧进行排序。数据长度等于100时,经过时间为0.3。 1000是0.58。 10,000 = 27.72。我需要这个运行200K +的记录。
谢谢!解决方案这是非破坏性的,即原始
df
未被修改。没有使用包装。df
行的原始顺序保留;然而,如果这不重要,则可以省略最后一行的[order(o),]
。o< - order(df $ subject,df $ month)
transform(df [o,],cumulative = ave(measurement,subject,FUN = cumsum))[订单(o),]
给:
测量主体月累积
1 0.37955924狗5 2.2580530
2 0.43577158狗4 1.8784938
3 0.03743103狗3 1.4427222
4 0.97353991狗2 1.4052912
5 0.43175125狗1 0.4317512
6 0.95757660猫5 4.0751151
7 0.88775491猫4 3.1175385
8 0.63997877猫3 2.2297836
9 0.97096661猫2 1.5898048
10 0.61883821 cat 1 0.6188382
I wrote a function in R to calculate cumulative totals by month number, but the execution time of my method grows exponentially as the dataset gets larger. I'm a novice R programmer, can you help me make this more efficient?
The function and the way I invoke the function:accumulate <- function(recordnum,df){ sumthese <- (df$subject == df$subject[recordnum]) & (df$month <= df$month[recordnum]) sum(df$measurement[sumthese]) } set.seed(42) datalength = 10 df <- data.frame(measurement = runif(1:datalength), subject=rep(c("dog","cat"),each =datalength/2), month=rep(seq(datalength/2,1,by=-1))) system.time(df$cumulative <- sapply(1:datalength,accumulate,df))
The input dataframe:
> df measurement subject month 1 0.4577418 dog 5 2 0.7191123 dog 4 3 0.9346722 dog 3 4 0.2554288 dog 2 5 0.4622928 dog 1 6 0.9400145 cat 5 7 0.9782264 cat 4 8 0.1174874 cat 3 9 0.4749971 cat 2 10 0.5603327 cat 1
The output dataframe:
> df measurement subject month cumulative 1 0.9148060 dog 5 3.6102141 2 0.9370754 dog 4 2.6954081 3 0.2861395 dog 3 1.7583327 4 0.8304476 dog 2 1.4721931 5 0.6417455 dog 1 0.6417455 6 0.5190959 cat 5 2.7524079 7 0.7365883 cat 4 2.2333120 8 0.1346666 cat 3 1.4967237 9 0.6569923 cat 2 1.3620571 10 0.7050648 cat 1 0.7050648
Notice the cumulative column shows the accumulation of all measurements up to and including the current month. The function does not require the dataframe to be sorted. When the datalength equals 100, the elapsed time is 0.3. 1000 is 0.58. 10,000 = 27.72. I need this to run for 200K+ records.
Thanks!解决方案This is non-destructive, i.e. the original
df
is not modified. No packages are used. The original order of the rows ofdf
is preserved; however, if that is not important then[order(o), ]
on the last line can be omitted.o <- order(df$subject, df$month) transform(df[o, ], cumulative = ave(measurement, subject, FUN = cumsum))[order(o), ]
giving:
measurement subject month cumulative 1 0.37955924 dog 5 2.2580530 2 0.43577158 dog 4 1.8784938 3 0.03743103 dog 3 1.4427222 4 0.97353991 dog 2 1.4052912 5 0.43175125 dog 1 0.4317512 6 0.95757660 cat 5 4.0751151 7 0.88775491 cat 4 3.1175385 8 0.63997877 cat 3 2.2297836 9 0.97096661 cat 2 1.5898048 10 0.61883821 cat 1 0.6188382
这篇关于如何在R中优化sapply以计算数据帧上的运行总计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!