R使用ddply或聚合 [英] R use ddply or aggregate

查看:76
本文介绍了R使用ddply或聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含三列的数据框:custId,saleDate,DelivDateTime.

I have a data frame with 3 columns: custId, saleDate, DelivDateTime.

> head(events22)
     custId            saleDate      DelivDate
1 280356593 2012-11-14 14:04:59 11/14/12 17:29
2 280367076 2012-11-14 17:04:44 11/14/12 20:48
3 280380097 2012-11-14 17:38:34 11/14/12 20:45
4 280380095 2012-11-14 20:45:44 11/14/12 23:59
5 280380095 2012-11-14 20:31:39 11/14/12 23:49
6 280380095 2012-11-14 19:58:32 11/15/12 00:10

这是赔率:

> dput(events22)
structure(list(custId = c(280356593L, 280367076L, 280380097L, 
280380095L, 280380095L, 280380095L, 280364279L, 280364279L, 280398506L, 
280336395L, 280364376L, 280368458L, 280368458L, 280368456L, 280368456L, 
280364225L, 280391721L, 280353458L, 280387607L, 280387607L), 
    saleDate = structure(c(1352901899.215, 1352912684.484, 1352914714.971, 
    1352925944.429, 1352925099.247, 1352923112.636, 1352922476.55, 
    1352920666.968, 1352915226.534, 1352911135.077, 1352921349.592, 
    1352911494.975, 1352910529.86, 1352924755.295, 1352907511.476, 
    1352920108.577, 1352906160.883, 1352905925.134, 1352916810.309, 
    1352916025.673), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    DelivDate = c("11/14/12 17:29", "11/14/12 20:48", "11/14/12 20:45", 
    "11/14/12 23:59", "11/14/12 23:49", "11/15/12 00:10", "11/14/12 23:35", 
    "11/14/12 22:59", "11/14/12 20:53", "11/14/12 19:52", "11/14/12 23:01", 
    "11/14/12 19:47", "11/14/12 19:42", "11/14/12 23:31", "11/14/12 23:33", 
    "11/14/12 22:45", "11/14/12 18:11", "11/14/12 18:12", "11/14/12 19:17", 
    "11/14/12 19:19")), .Names = c("custId", "saleDate", "DelivDate"
), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", 
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20"
), class = "data.frame")

我正在尝试为每个custId找到最新的saleDateDelivDate.

I'm trying to find the DelivDate for the most recent saleDate for each custId.

我可以这样使用plyr :: ddply做到这一点:

I can do that using plyr::ddply like this:

dd1 <-ddply(events22, .(custId),.inform = T, function(x){
x[x$saleDate == max(x$saleDate),"DelivDate"]
})

我的问题是,是否有更快的方法来完成此操作,因为ddply方法非常耗时(整个数据集约为40万行).我已经看过使用aggregate()的情况,但是不知道如何获取除我排序依据的值以外的其他值.

My question is whether there is a faster way to do this as the ddply method is a bit time consuming (the full data set is ~ 400k lines). I've looked at using aggregate() but don't know how to get a value other than the one I'm sorting by.

有什么建议吗?

这是10k行@ 10次迭代的基准结果:

Here's the benchmark results for 10k lines @ 10 iterations:

      test replications elapsed relative user.self
2   AGG2()           10    5.96    1.000      5.93
1   AGG1()           10   20.87    3.502     20.75
5 DATATABLE()        10   61.32        1     60.31
3  DDPLY()           10   80.04   13.430     79.63
4 DOCALL()           10   90.43   15.173     88.39

虽然速度最快,但AGG2()无法给出正确的答案.

EDIT2 : While being quickest AGG2() doesn't give the correct answer.

    > head(agg2)
     custId            saleDate      DelivDate
1 280336395 2012-11-14 16:38:55 11/14/12 19:52
2 280353458 2012-11-14 15:12:05 11/14/12 18:12
3 280356593 2012-11-14 14:04:59 11/14/12 17:29
4 280364225 2012-11-14 19:08:28 11/14/12 22:45
5 280364279 2012-11-14 19:47:56 11/14/12 23:35
6 280364376 2012-11-14 19:29:09 11/14/12 23:01
> agg2 <- AGG2()
> head(agg2)
     custId      DelivDate
1 280336395 11/14/12 17:29
2 280353458 11/14/12 17:29
3 280356593 11/14/12 17:29
4 280364225 11/14/12 17:29
5 280364279 11/14/12 17:29
6 280364376 11/14/12 17:29
> agg2 <- DDPLY()
> head(agg2)
     custId             V1
1 280336395 11/14/12 19:52
2 280353458 11/14/12 18:12
3 280356593 11/14/12 17:29
4 280364225 11/14/12 22:45
5 280364279 11/14/12 23:35
6 280364376 11/14/12 23:01

推荐答案

我也将在这里推荐data.table,但是由于您要求的是aggregate解决方案,因此这是结合了aggregate获取所有列:

I, too, would recommend data.table here, but since you asked for an aggregate solution, here is one which combines aggregate and merge to get all the columns:

merge(events22, aggregate(saleDate ~ custId, events22, max))

或者如果您只需要"custId"和"DelivDate"列,则只需aggregate:

Or just aggregate if you only want the "custId" and "DelivDate" columns:

aggregate(list(DelivDate = events22$saleDate), 
          list(custId = events22$custId),
          function(x) events22[["DelivDate"]][which.max(x)])

最后,这是使用sqldf的选项:

Finally, here's an option using sqldf:

library(sqldf)
sqldf("select custId, DelivDate, max(saleDate) `saleDate` 
      from events22 group by custId")


基准

我不是基准测试或data.table专家,但令我惊讶的是data.table在这里不是更快. 我怀疑在较大的数据集上结果会大不相同,例如,您的40万行.无论如何,这里有一些基准测试代码是根据@mnel的答案在此处建模的,因此您可以对实际数据集进行一些测试,以备将来参考.


Benchmarks

I'm not a benchmarking or data.table expert, but it surprised me that data.table is not faster here. My suspicion is that the results would be quite different on a larger dataset, say for instance, your 400k lines one. Anyway, here's some benchmarking code modeled after @mnel's answer here so you can do some tests on your actual dataset for future reference.

library(rbenchmark)

首先,为要进行基准测试的功能设置功能.<​​/p>

First, set up your functions for what you want to benchmark.

DDPLY <- function() { 
  x <- ddply(events22, .(custId), .inform = T, 
             function(x) {
               x[x$saleDate == max(x$saleDate),"DelivDate"]}) 
}
DATATABLE <- function() { x <- dt[, .SD[which.max(saleDate), ], by = custId] }
AGG1 <- function() { 
  x <- merge(events22, aggregate(saleDate ~ custId, events22, max)) }
AGG2 <- function() { 
  x <- aggregate(list(DelivDate = events22$saleDate), 
                 list(custId = events22$custId),
                 function(x) events22[["DelivDate"]][which.max(x)]) }
SQLDF <- function() { 
  x <- sqldf("select custId, DelivDate, max(saleDate) `saleDate` 
             from events22 group by custId") }
DOCALL <- function() {
  do.call(rbind, 
          lapply(split(events22, events22$custId), function(x){
            x[which.max(x$saleDate), ]
          })
  )
}

第二,进行基准测试.

benchmark(DDPLY(), DATATABLE(), AGG1(), AGG2(), SQLDF(), DOCALL(), 
          order = "elapsed")[1:5]
#          test replications elapsed relative user.self
# 4      AGG2()          100   0.285    1.000     0.284
# 3      AGG1()          100   0.891    3.126     0.896
# 6    DOCALL()          100   1.202    4.218     1.204
# 2 DATATABLE()          100   1.251    4.389     1.248
# 1     DDPLY()          100   1.254    4.400     1.252
# 5     SQLDF()          100   2.109    7.400     2.108

这篇关于R使用ddply或聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆