如何在R中优化sapply以计算数据帧上的运行总计 [英] How do I optimize sapply in R to calculate running totals on a dataframe

查看:99
本文介绍了如何在R中优化sapply以计算数据帧上的运行总计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中写了一个函数以按月数计算累积总数,但随着数据集变大,我方法的执行时间呈指数增长。我是一个新手R程序员,你能帮助我提高效率吗?

函数和调用函数的方式:

  accumulate<  -  function(recordnum,df){
sumthese < - (df $ subject == df $ subject [recordnum])&
(df $ month< = df $ month [recordnum])
sum(df $ measurement [sumthese])
}
set.seed(42)
datalength = 10
df< - data.frame(measure = runif(1:datalength),
subject = rep(c(dog,cat),each = datalength / 2)
month = rep(seq(datalength / 2,1,by = -1)))
system.time(df $ cumulative< - sapply(1:datalength,accumulate,df))

输入数据框:

 code>> df 
测量主题月份
1 0.4577418狗5
2 0.7191123狗4
3 0.9346722狗3
4 0.2554288狗2
5 0.4622928狗1
6 0.9400145 cat 5
7 0.9782264 cat 4
8 0.1174874 cat 3
9 0.4749971 cat 2
10 0.5603327 cat 1
/ pre>

输出数据框:

 > df 
测量主体月累积
1 0.9148060狗5 3.6102141
2 0.9370754狗4 2.6954081
3 0.2861395狗3 1.7583327
4 0.8304476狗2 1.4721931
5 0.6417455狗1 0.6417455
6 0.5190959猫5 2.7524079
7 0.7365883猫4 2.2333120
8 0.1346666猫3 1.4967237
9 0.6569923猫2 1.3620571
10 0.7050648猫1 0.7050648

注意,累积列显示了直到并包括当前月份的所有度量的积累。该功能不需要对数据帧进行排序。数据长度等于100时,经过时间为0.3。 1000是0.58。 10,000 = 27.72。我需要这个运行200K +的记录。

谢谢!

解决方案

这是非破坏性的,即原始 df 未被修改。没有使用包装。 df 行的原始顺序保留;然而,如果这不重要,则可以省略最后一行的 [order(o),]

  o<  -  order(df $ subject,df $ month)
transform(df [o,],cumulative = ave(measurement,subject,FUN = cumsum))[订单(o),]

给:

 测量主体月累积
1 0.37955924狗5 2.2580530
2 0.43577158狗4 1.8784938
3 0.03743103狗3 1.4427222
4 0.97353991狗2 1.4052912
5 0.43175125狗1 0.4317512
6 0.95757660猫5 4.0751151
7 0.88775491猫4 3.1175385
8 0.63997877猫3 2.2297836
9 0.97096661猫2 1.5898048
10 0.61883821 cat 1 0.6188382


I wrote a function in R to calculate cumulative totals by month number, but the execution time of my method grows exponentially as the dataset gets larger. I'm a novice R programmer, can you help me make this more efficient?
The function and the way I invoke the function:

accumulate <- function(recordnum,df){
    sumthese <- (df$subject == df$subject[recordnum]) &
        (df$month <= df$month[recordnum])
    sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
                 subject=rep(c("dog","cat"),each =datalength/2),
                 month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))

The input dataframe:

> df
   measurement subject month
1    0.4577418     dog     5
2    0.7191123     dog     4
3    0.9346722     dog     3
4    0.2554288     dog     2
5    0.4622928     dog     1
6    0.9400145     cat     5
7    0.9782264     cat     4
8    0.1174874     cat     3
9    0.4749971     cat     2
10   0.5603327     cat     1

The output dataframe:

> df
   measurement subject month cumulative
1    0.9148060     dog     5  3.6102141
2    0.9370754     dog     4  2.6954081
3    0.2861395     dog     3  1.7583327
4    0.8304476     dog     2  1.4721931
5    0.6417455     dog     1  0.6417455
6    0.5190959     cat     5  2.7524079
7    0.7365883     cat     4  2.2333120
8    0.1346666     cat     3  1.4967237
9    0.6569923     cat     2  1.3620571
10   0.7050648     cat     1  0.7050648

Notice the cumulative column shows the accumulation of all measurements up to and including the current month. The function does not require the dataframe to be sorted. When the datalength equals 100, the elapsed time is 0.3. 1000 is 0.58. 10,000 = 27.72. I need this to run for 200K+ records.
Thanks!

解决方案

This is non-destructive, i.e. the original df is not modified. No packages are used. The original order of the rows of df is preserved; however, if that is not important then [order(o), ] on the last line can be omitted.

o <- order(df$subject, df$month)
transform(df[o, ], cumulative = ave(measurement, subject, FUN = cumsum))[order(o), ]

giving:

   measurement subject month cumulative
1   0.37955924     dog     5  2.2580530
2   0.43577158     dog     4  1.8784938
3   0.03743103     dog     3  1.4427222
4   0.97353991     dog     2  1.4052912
5   0.43175125     dog     1  0.4317512
6   0.95757660     cat     5  4.0751151
7   0.88775491     cat     4  3.1175385
8   0.63997877     cat     3  2.2297836
9   0.97096661     cat     2  1.5898048
10  0.61883821     cat     1  0.6188382

这篇关于如何在R中优化sapply以计算数据帧上的运行总计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆