对R中的多个变量进行分组 [英] Grouping on multiple variables in R
问题描述
我正在尝试通过2个不同的变量对用户数据进行分组,同时将变量分组到范围(或仓),然后汇总其他变量。
这是数据的样子:
收入
1 25 0 25
2 2 2 0
3 86 7 8
4 128 24 94
5 30 5 18
... ... ... ...
280000 80 10 100
280001 42 4 25
280002 31 8 17
这是我想让输出如下所示:
VisitRange PostRange用户数总收入平均收入
0 0 XYZ
1-10 0 XYZ
11-20 0 XYZ
21-30 0 XYZ
31-40 0 XYZ
41-50 0 XYZ
> 50 0 XYZ
0 1-10 XYZ
1-10 1-10 XYZ
11-20 1-10 XYZ
21-30 1-10 XYZ
31 -40 1-10 XYZ
41-50 1-10 XYZ
> 50 1-10 XYZ
想通过访问和帖子分组10到一定水平,然后将50以上的任何组合分组为> 51
我已经看到了一个自由和直观的方式来完成这个,但我不认为他们会这样工作我期待,但我可能是错的。
最后,我知道我可以在SQL中使用和if / then语句来确定访问范围和职位范围(例如 - 如果1之间的访问和10,那么'1-10'),那么只是按访问范围和职位范围,但我的目标是开始强迫自己使用R.也许R在这里不是正确的工具,但我认为是...
所有的帮助将不胜感激。提前致谢。
plyr
包中的成语和 ddply
特别是与Excel中的数据透视表非常相似。
在你的例子中,你唯一需要做的就是 cut
将您的分组变量转换为所需的休息,然后传递到 ddply
。以下是一个示例:
首先,创建一些示例数据:
set.seed(1)
/ pre>
dat < - data.frame(
userid = 1:500,
visits = sample(0:50,500,replace = TRUE),
posts = sample(0:50,500,replace = TRUE),
revenue = sample(1:100,replace = TRUE)
)
现在,使用
cut
将您的分组变量分成所需的范围:dat $ PostRange< - cut(dat $ posts,breaks = seq(0,50,10),include.lowest = TRUE)
dat $ VisitRange< - cut(dat $ visits,breaks = seq(0,50,10),include.lowest = TRUE)
最后,使用
ddply
与总结
:library(plyr)
ddply(dat,。(VisitRange,PostRange),
summaryize,
Users = length ),
`总收入=总和(收入),
`平均收入=平均(收入))
结果:
VisitRange PostRange用户总收入平均收入
1 [0,10] [0,10] 23 1318 57.30435
2 [0,10] (10,20)23 1136 49.39130
3 [0,10](20,30)28 1499 53.53571
4 [0,10](30,40)20 923 46.15000
5 [ 0,10](40,50] 14 826 59.00000
6(10,20)[0,10] 23 1227 53.34783
7(10,20)(10,20)17 642 37.76471
8(10,20)(20,30)20 888 44.40000
9(10,20)(30,40)15 622 41.46667
10(10,20)(40,50] 21 968 46.09524
11(20,30] [0,10] 23 1226 53.30435
12(20,30](10,20)19 1021 53.73684
13(20,30)(20 ,30] 23 1380 60.00000
14(20,30)(30,40)8 313 39.12500
15(20 ,30](40,50] 19 710 37.36842
16(30,40] [0,10] 18 782 43.44444
17(30,40)(10,20)25 1308 52.32000
18(30,40)(20,30)14 553 39.50000
19(30,40)(30,40)26 1131 43.50000
20(30,40)(40,50] 20 1295 64.75000
21(40,50] [0,10] 20 958 47.90000
22(40,50](10,20)21 1168 55.61905
23(40,50)(20, 30] 20 1118 55.90000
24(40,50](30,40)20 1009 50.45000
25(40,50)(40,50] 20 934 46.70000
I'm a power excel pivot table user who is forcing himself to learn R. I know exactly how to do this analysis in excel, but can't figure out the right way to code this in R.
I'm trying to group user data by 2 different variables, while grouping the variables into ranges (or bins), then summarizing other variables.
Here is what the data looks like:
userid visits posts revenue 1 25 0 25 2 2 2 0 3 86 7 8 4 128 24 94 5 30 5 18 … … … … 280000 80 10 100 280001 42 4 25 280002 31 8 17
Here is what I am trying to get the output to look like:
VisitRange PostRange # of Users Total Revenue Average Revenue 0 0 X Y Z 1-10 0 X Y Z 11-20 0 X Y Z 21-30 0 X Y Z 31-40 0 X Y Z 41-50 0 X Y Z > 50 0 X Y Z 0 1-10 X Y Z 1-10 1-10 X Y Z 11-20 1-10 X Y Z 21-30 1-10 X Y Z 31-40 1-10 X Y Z 41-50 1-10 X Y Z > 50 1-10 X Y Z
want to group by visits and posts by 10 up to a certain level, then group anything higher than 50 as '> 51'
I've looked a tapply and ddply as ways to accomplish this, but I don't think they will work the way I am expecting, but I could be wrong.
Lastly, I know I could do this in SQL using and if/then statement to identify the range of visits and the range of posts (for example - If visits between 1 and 10, then '1-10'), then just group by visit range and post range, but my goal here is to start forcing myself to use R. Maybe R isn't the right tool here, but I think it is…
All help would be appreciated. Thanks in advance.
解决方案The idiom in the
plyr
package andddply
in particular, is very similar to pivot tables in Excel.In your example, the only thing you need to do is the
cut
your grouping variables into the desired breaks, before passing toddply
. Here is an example:First, create some sample data:
set.seed(1) dat <- data.frame( userid = 1:500, visits =sample(0:50, 500, replace=TRUE), posts = sample(0:50, 500, replace=TRUE), revenue = sample(1:100, replace=TRUE) )
Now, use
cut
to divide your grouping variables into the desired ranges:dat$PostRange <- cut(dat$posts, breaks=seq(0, 50, 10), include.lowest=TRUE) dat$VisitRange <- cut(dat$visits, breaks=seq(0, 50, 10), include.lowest=TRUE)
Finally, use
ddply
withsummarise
:library(plyr) ddply(dat, .(VisitRange, PostRange), summarise, Users=length(userid), `Total Revenue`=sum(revenue), `Average Revenue`=mean(revenue))
The results:
VisitRange PostRange Users Total Revenue Average Revenue 1 [0,10] [0,10] 23 1318 57.30435 2 [0,10] (10,20] 23 1136 49.39130 3 [0,10] (20,30] 28 1499 53.53571 4 [0,10] (30,40] 20 923 46.15000 5 [0,10] (40,50] 14 826 59.00000 6 (10,20] [0,10] 23 1227 53.34783 7 (10,20] (10,20] 17 642 37.76471 8 (10,20] (20,30] 20 888 44.40000 9 (10,20] (30,40] 15 622 41.46667 10 (10,20] (40,50] 21 968 46.09524 11 (20,30] [0,10] 23 1226 53.30435 12 (20,30] (10,20] 19 1021 53.73684 13 (20,30] (20,30] 23 1380 60.00000 14 (20,30] (30,40] 8 313 39.12500 15 (20,30] (40,50] 19 710 37.36842 16 (30,40] [0,10] 18 782 43.44444 17 (30,40] (10,20] 25 1308 52.32000 18 (30,40] (20,30] 14 553 39.50000 19 (30,40] (30,40] 26 1131 43.50000 20 (30,40] (40,50] 20 1295 64.75000 21 (40,50] [0,10] 20 958 47.90000 22 (40,50] (10,20] 21 1168 55.61905 23 (40,50] (20,30] 20 1118 55.90000 24 (40,50] (30,40] 20 1009 50.45000 25 (40,50] (40,50] 20 934 46.70000
这篇关于对R中的多个变量进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!