如何使用dplyr替代聚合 [英] How to use dplyr as alternative to aggregate

查看:79
本文介绍了如何使用dplyr替代聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据帧看起来像这样:

I have a dataframe times that looks like this:

user     time
A        7/7/2010
B        7/12/2010
C        7/12/2010
A        7/12/2010 
C        7/15/2010

我正在使用 aggregate(time〜user,times,function (x)sort(as.vector(x)))以获取此信息:

I'm using aggregate(time ~ user, times, function(x) sort(as.vector(x))) to get this:

user     time
A        c(7/7/2010, 7/12/2010)
B        c(7/12/2010)
C        c(7/12/2010, 7/15/2010)

问题是我在次中有超过2000万个条目,因此总计耗时超过4个小时。是否有使用 dplyr 的替代方法可以让我获得日期的排序向量?

The problem is that I have over 20 million entries in times so aggregate is taking a over 4 hours. Is there any alternative using dplyr that will get me the sorted vector of dates?

推荐答案

更新后的答案::根据您的评论,

library(dplyr)

# Data (with a few additions)
times = read.table(text="user     time
A        7/7/2010
B        7/12/2010
B 7/13/2010
C        7/12/2010
A        7/12/2010 
A 7/13/2010
C        7/15/2010", header=TRUE, stringsAsFactors=FALSE)

times$time = as.Date(times$time, "%m/%d/%Y")

times




  user       time
1    A 2010-07-07
2    B 2010-07-12
3    B 2010-07-13
4    C 2010-07-12
5    A 2010-07-12
6    A 2010-07-13
7    C 2010-07-15




times %>% group_by(user) %>%
  summarise(First=min(time),
            Last=max(time),
            N = n(),
            minDiff=min(diff(time)),
            meanDiff=mean(diff(time)),
            NumDiffUniq = length(unique(diff(time))))




   user      First       Last     N        minDiff       meanDiff NumDiffUniq
1     A 2010-07-07 2010-07-13     3         1 days         3 days           2
2     B 2010-07-12 2010-07-13     2         1 days         1 days           1
3     C 2010-07-12 2010-07-15     2         3 days         3 days           1


原始答案:

我不清楚您在尝试什么去完成。如果只希望对数据框进行排序,则可以使用 dplyr 进行操作:

I'm not clear on what you're trying to accomplish. If you just want your data frame to be sorted, then with dplyr you would do:

library(dplyr)

times.sorted = times %>% arrange(user, time)

如果您希望时间成为每个用户的日期字符串,那么您可以这样做:

If you want time to become a string of dates for each user, then you could do:

times.summary = times %>% group_by(user) %>%
  summarise(time = paste(time, collapse=","))

但请注意,每个用户,这将导致包含日期的单个字符串。

But note that for each user this will result in a single string containing the dates.

times.summary




   user                time
1     A  7/7/2010,7/12/2010
2     B           7/12/2010
3     C 7/12/2010,7/15/2010


如果您实际上希望每个单元格都是日期的向量,则可以将每个单元格作为一个列表(尽管可能有更好的方法)。例如:

If you actually want each cell to be a vector of dates, you could make each cell a list (though there might be a better way). For example:

times.new = times %>% group_by(user) %>%
  summarise(time = list(as.vector(time)))

times.new$time




[[1]]
[1] "7/7/2010"  "7/12/2010"

[[2]]
[1] "7/12/2010"

[[3]]
[1] "7/12/2010" "7/15/2010"


但是,如果您的目标是按组分析数据,那么您实际上不需要执行任何上述操作。您可以使用base, dplyr data.table 函数按组执行任何分析,而无需先对数据进行排序。

But if your goal is to analyze your data by group, then you don't actually need to do any of the above. You can use base, dplyr, or data.table functions to perform any analysis by group without first sorting your data.

这篇关于如何使用dplyr替代聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆