如何使用dplyr替代聚合 [英] How to use dplyr as alternative to aggregate
问题描述
我有一个数据帧次
看起来像这样:
I have a dataframe times
that looks like this:
user time
A 7/7/2010
B 7/12/2010
C 7/12/2010
A 7/12/2010
C 7/15/2010
我正在使用 aggregate(time〜user,times,function (x)sort(as.vector(x)))
以获取此信息:
I'm using aggregate(time ~ user, times, function(x) sort(as.vector(x)))
to get this:
user time
A c(7/7/2010, 7/12/2010)
B c(7/12/2010)
C c(7/12/2010, 7/15/2010)
问题是我在次中有超过2000万个条目
,因此总计
耗时超过4个小时。是否有使用 dplyr
的替代方法可以让我获得日期的排序向量?
The problem is that I have over 20 million entries in times
so aggregate
is taking a over 4 hours. Is there any alternative using dplyr
that will get me the sorted vector of dates?
推荐答案
更新后的答案::根据您的评论,
library(dplyr)
# Data (with a few additions)
times = read.table(text="user time
A 7/7/2010
B 7/12/2010
B 7/13/2010
C 7/12/2010
A 7/12/2010
A 7/13/2010
C 7/15/2010", header=TRUE, stringsAsFactors=FALSE)
times$time = as.Date(times$time, "%m/%d/%Y")
times
user time
1 A 2010-07-07
2 B 2010-07-12
3 B 2010-07-13
4 C 2010-07-12
5 A 2010-07-12
6 A 2010-07-13
7 C 2010-07-15
times %>% group_by(user) %>%
summarise(First=min(time),
Last=max(time),
N = n(),
minDiff=min(diff(time)),
meanDiff=mean(diff(time)),
NumDiffUniq = length(unique(diff(time))))
user First Last N minDiff meanDiff NumDiffUniq
1 A 2010-07-07 2010-07-13 3 1 days 3 days 2
2 B 2010-07-12 2010-07-13 2 1 days 1 days 1
3 C 2010-07-12 2010-07-15 2 3 days 3 days 1
原始答案:
我不清楚您在尝试什么去完成。如果只希望对数据框进行排序,则可以使用 dplyr
进行操作:
I'm not clear on what you're trying to accomplish. If you just want your data frame to be sorted, then with dplyr
you would do:
library(dplyr)
times.sorted = times %>% arrange(user, time)
如果您希望时间
成为每个用户
的日期字符串,那么您可以这样做:
If you want time
to become a string of dates for each user
, then you could do:
times.summary = times %>% group_by(user) %>%
summarise(time = paste(time, collapse=","))
但请注意,每个用户,这将导致包含日期的单个字符串。
But note that for each user this will result in a single string containing the dates.
times.summary
user time
1 A 7/7/2010,7/12/2010
2 B 7/12/2010
3 C 7/12/2010,7/15/2010
如果您实际上希望每个单元格都是日期的向量,则可以将每个单元格作为一个列表(尽管可能有更好的方法)。例如:
If you actually want each cell to be a vector of dates, you could make each cell a list (though there might be a better way). For example:
times.new = times %>% group_by(user) %>%
summarise(time = list(as.vector(time)))
times.new$time
[[1]]
[1] "7/7/2010" "7/12/2010"
[[2]]
[1] "7/12/2010"
[[3]]
[1] "7/12/2010" "7/15/2010"
但是,如果您的目标是按组分析数据,那么您实际上不需要执行任何上述操作。您可以使用base, dplyr
或 data.table
函数按组执行任何分析,而无需先对数据进行排序。
But if your goal is to analyze your data by group, then you don't actually need to do any of the above. You can use base, dplyr
, or data.table
functions to perform any analysis by group without first sorting your data.
这篇关于如何使用dplyr替代聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!