在dplyr中,每组都有一个cumsum [英] r cumsum per group in dplyr
问题描述
我开始享受 dplyr
,但我陷入了一个用例。我希望能够在包中使用数据框中的每个组应用 cumsum
,但是我似乎无法做到。
对于演示数据框,我生成了以下数据:
set.seed(123)
len = 10
dates = as.Date('2014-01-01')+ 1:len
grp_a = data.frame(日期=日期,组='A',销售= rnorm(len))
grp_b = data.frame(日期=日期,group ='B',销售= rnorm(len))
grp_c = data.frame(日期=日期,组='C',销售= rnorm(len))
df = rbind(grp_a,grp_b,grp_c)
这会创建一个如下所示的数据框:
dates group sales
1 2014-01-02 A -0.56047565
2 2014-01-03 A -0.23017749
3 2014-01-04 A 1.55870831
4 2014-01-05 A 0.07050839
5 2014-01-06 A 0.12928774
6 2014-01-02 B 1.71506499
7 2014-01-03 B 0.46091621
8 2014-01-04 B -1.26506123
9 2014-01-05 B -0.68685285
10 2014-01-06 B -0.44566197
11 2014-01-02 C 1.22408180
12 2014-01-03 C 0.35981383
13 2014-01-04 C 0.40077145
14 2014-01-05 C 0.11068272
15 2014-01-06 C -0.55584113
然后我创建一个用于绘图的数据框,但是用一个for循环来代替更干净的东西。
pdf = data.frame(dates = as.Date(as.character()),group = as.character() ,销售= as.numeric())
pre>
for(grp in unique(df $ group)){
subs = filter(df,group == grp)%>%排列(日期)
pdf = rbind(pdf,data.frame(dates = subs $ dates,group = grp,sales = cumsum(subs $ sales)))
}
我用这个
p = ggplot()
p = p + geom_line(data = pdf,aes(dates,sales,color =组))
p + ggtitle(每组销售额)
有没有更好的方法dplyr方法)来创建这个数据帧?我查看了
summarize
方法,但是这似乎是从N个项目 - > 1个项目中汇总一个组。这个用例似乎目前打破了我的dplyr流程。任何建议,以更好地处理这个?
解决方案啊。摆弄后我似乎找到了它。
pdf = df%>%group_by(group)%>%排列(日期)%>%mutate(cs = cumsum(sales))
使用forloop进行输出:
> pdf = data.frame(dates = as.Date(as.character()),group = as.character(),sales = as.numeric())
> for(grp in unique(df $ group)){
+ subs = filter(df,group == grp)%>%arrange(dates)
+ pdf = rbind(pdf,data.frame (日期= subs $日期,group = grp,sales = subs $ sales,cs = cumsum(subs $销售)))
+}
日期团体销售额cs
1 2014-01-02 A -0.56047565 -0.5604756
2 2014-01-03 A -0.23017749 -0.7906531
3 2014-01-04 A 1.55870831 0.7680552
4 2014-01-05 A 0.07050839 0.8385636
5 2014-01-06 A 0.12928774 0.9678513
6 2014-01-02 B 1.71506499 1.7150650
7 2014-01- 03 B 0.46091621 2.1759812
8 2014-01-04 B -1.26506123 0.9109200
9 2014-01-05 B -0.68685285 0.2240671
10 2014-01-06 B -0.44566197 -0.2215949
11 2014-01-02 C 1.22408180 1.2240818
12 2014-01-03 C 0.35981383 1.5838956
13 2014-01-04 C 0.40077145 1.9846671
14 2014-01-05 C 0.11068272 2.0953498
15 2014-01-06 C -0.55584113 1.5395087
输出这行代码:
> pdf = df%>%group_by(group)%>%mutate(cs = cumsum(sales))
来源:本地资料框[15 x 4]
团体:团体
日期团体销售额cs
1 2014-01-02 A -0.56047565 -0.5604756
2 2014-01-03 A -0.23017749 -0.7906531
3 2014-01-04 A 1.55870831 0.7680552
4 2014-01-05 A 0.07050839 0.8385636
5 2014-01-06 A 0.12928774 0.9678513
6 2014-01-02 B 1.71506499 1.7150650
7 2014-01-03 B 0.46091621 2.1759812
8 2014-01-04 B -1.26506123 0.9109200
9 2014- 01-05 B -0.68685285 0.2240671
10 2014-01-06 B -0.44566197 -0.2215949
11 2014-01-02 C 1.22408180 1.2240818
12 2014-01-03 C 0.35981383 1.5838956
13 2014-01-04 C 0.40077145 1.9846671
14 2014-01-05 C 0.11068272 2.0953498
15 2014-01-06 C -0.55584113 1.5395087
I am starting to enjoy
dplyr
but I got stuck on a use case. I want to be able to applycumsum
per group in a dataframe with the package but I can't seem to get it right.For a demo dataframe I've generated the following data:
set.seed(123) len = 10 dates = as.Date('2014-01-01') + 1:len grp_a = data.frame(dates=dates, group='A', sales=rnorm(len)) grp_b = data.frame(dates=dates, group='B', sales=rnorm(len)) grp_c = data.frame(dates=dates, group='C', sales=rnorm(len)) df = rbind(grp_a, grp_b, grp_c)
This creates a dataframe that looks like:
dates group sales 1 2014-01-02 A -0.56047565 2 2014-01-03 A -0.23017749 3 2014-01-04 A 1.55870831 4 2014-01-05 A 0.07050839 5 2014-01-06 A 0.12928774 6 2014-01-02 B 1.71506499 7 2014-01-03 B 0.46091621 8 2014-01-04 B -1.26506123 9 2014-01-05 B -0.68685285 10 2014-01-06 B -0.44566197 11 2014-01-02 C 1.22408180 12 2014-01-03 C 0.35981383 13 2014-01-04 C 0.40077145 14 2014-01-05 C 0.11068272 15 2014-01-06 C -0.55584113
I then go on to create a dataframe for plotting, but with a for loop that I'd like to replace with something cleaner.
pdf = data.frame(dates=as.Date(as.character()), group=as.character(), sales=as.numeric()) for(grp in unique(df$group)){ subs = filter(df, group == grp) %>% arrange(dates) pdf = rbind(pdf, data.frame(dates=subs$dates, group=grp, sales=cumsum(subs$sales))) }
I use this
p = ggplot() p = p + geom_line(data=pdf, aes(dates, sales, colour=group)) p + ggtitle("sales per group")
Is there a better way (a way by using the dplyr methods) to create this dataframe? I've looked at the
summarize
method but this seems to aggregate a group from N items -> 1 item. This use case seems to break my dplyr flow at the moment. Any suggestions to better approach this?解决方案Ah. After fiddling around I seem to have found it.
pdf = df %>% group_by(group) %>% arrange(dates) %>% mutate(cs = cumsum(sales))
Output with forloop in question:
> pdf = data.frame(dates=as.Date(as.character()), group=as.character(), sales=as.numeric()) > for(grp in unique(df$group)){ + subs = filter(df, group == grp) %>% arrange(dates) + pdf = rbind(pdf, data.frame(dates=subs$dates, group=grp, sales=subs$sales, cs=cumsum(subs$sales))) + } > pdf dates group sales cs 1 2014-01-02 A -0.56047565 -0.5604756 2 2014-01-03 A -0.23017749 -0.7906531 3 2014-01-04 A 1.55870831 0.7680552 4 2014-01-05 A 0.07050839 0.8385636 5 2014-01-06 A 0.12928774 0.9678513 6 2014-01-02 B 1.71506499 1.7150650 7 2014-01-03 B 0.46091621 2.1759812 8 2014-01-04 B -1.26506123 0.9109200 9 2014-01-05 B -0.68685285 0.2240671 10 2014-01-06 B -0.44566197 -0.2215949 11 2014-01-02 C 1.22408180 1.2240818 12 2014-01-03 C 0.35981383 1.5838956 13 2014-01-04 C 0.40077145 1.9846671 14 2014-01-05 C 0.11068272 2.0953498 15 2014-01-06 C -0.55584113 1.5395087
Output with this line of code:
> pdf = df %>% group_by(group) %>% mutate(cs = cumsum(sales)) > pdf Source: local data frame [15 x 4] Groups: group dates group sales cs 1 2014-01-02 A -0.56047565 -0.5604756 2 2014-01-03 A -0.23017749 -0.7906531 3 2014-01-04 A 1.55870831 0.7680552 4 2014-01-05 A 0.07050839 0.8385636 5 2014-01-06 A 0.12928774 0.9678513 6 2014-01-02 B 1.71506499 1.7150650 7 2014-01-03 B 0.46091621 2.1759812 8 2014-01-04 B -1.26506123 0.9109200 9 2014-01-05 B -0.68685285 0.2240671 10 2014-01-06 B -0.44566197 -0.2215949 11 2014-01-02 C 1.22408180 1.2240818 12 2014-01-03 C 0.35981383 1.5838956 13 2014-01-04 C 0.40077145 1.9846671 14 2014-01-05 C 0.11068272 2.0953498 15 2014-01-06 C -0.55584113 1.5395087
这篇关于在dplyr中,每组都有一个cumsum的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!