对R中的多个变量进行分组 [英] Grouping on multiple variables in R

查看:747
本文介绍了对R中的多个变量进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一个强大的数据透视表用户,他强迫自己学习R.我知道如何在excel中做这个分析,但是找不出在R中编写正确的方法。



我正在尝试通过2个不同的变量对用户数据进行分组,同时将变量分组到范围(或仓),然后汇总其他变量。



这是数据的样子:

 收入
1 25 0 25
2 2 2 0
3 86 7 8
4 128 24 94
5 30 5 18
... ... ... ...
280000 80 10 100
280001 42 4 25
280002 31 8 17

这是我想让输出如下所示:

  VisitRange PostRange用户数总收入平均收入
0 0 XYZ
1-10 0 XYZ
11-20 0 XYZ
21-30 0 XYZ
31-40 0 XYZ
41-50 0 XYZ
> 50 0 XYZ
0 1-10 XYZ
1-10 1-10 XYZ
11-20 1-10 XYZ
21-30 1-10 XYZ
31 -40 1-10 XYZ
41-50 1-10 XYZ
> 50 1-10 XYZ

想通过访问和帖子分组10到一定水平,然后将50以上的任何组合分组为> 51



我已经看到了一个自由和直观的方式来完成这个,但我不认为他们会这样工作我期待,但我可能是错的。



最后,我知道我可以在SQL中使用和if / then语句来确定访问范围和职位范围(例如 - 如果1之间的访问和10,那么'1-10'),那么只是按访问范围和职位范围,但我的目标是开始强迫自己使用R.也许R在这里不是正确的工具,但我认为是...



所有的帮助将不胜感激。提前致谢。

解决方案

plyr 包中的成语和 ddply 特别是与Excel中的数据透视表非常相似。



在你的例子中,你唯一需要做的就是 cut 将您的分组变量转换为所需的休息,然后传递到 ddply 。以下是一个示例:



首先,创建一些示例数据:

  set.seed(1)
dat < - data.frame(
userid = 1:500,
visits = sample(0:50,500,replace = TRUE),
posts = sample(0:50,500,replace = TRUE),
revenue = sample(1:100,replace = TRUE)

/ pre>

现在,使用 cut 将您的分组变量分成所需的范围:

  dat $ PostRange<  -  cut(dat $ posts,breaks = seq(0,50,10),include.lowest = TRUE)
dat $ VisitRange< - cut(dat $ visits,breaks = seq(0,50,10),include.lowest = TRUE)

最后,使用 ddply 总结

  library(plyr)
ddply(dat,。(VisitRange,PostRange),
summaryize,
Users = length ),
`总收入=总和(收入),
`平均收入=平均(收入))

结果:

  VisitRange PostRange用户总收入平均收入
1 [0,10] [0,10] 23 1318 57.30435
2 [0,10] (10,20)23 1136 49.39130
3 [0,10](20,30)28 1499 53.53571
4 [0,10](30,40)20 923 46.15000
5 [ 0,10](40,50] 14 826 59.00000
6(10,20)[0,10] 23 1227 53.34783
7(10,20)(10,20)17 642 37.76471
8(10,20)(20,30)20 888 44.40000
9(10,20)(30,40)15 622 41.46667
10(10,20)(40,50] 21 968 46.09524
11(20,30] [0,10] 23 1226 53.30435
12(20,30](10,20)19 1021 53.73684
13(20,30)(20 ,30] 23 1380 60.00000
14(20,30)(30,40)8 313 39.12500
15(20 ,30](40,50] 19 710 37.36842
16(30,40] [0,10] 18 782 43.44444
17(30,40)(10,20)25 1308 52.32000
18(30,40)(20,30)14 553 39.50000
19(30,40)(30,40)26 1131 43.50000
20(30,40)(40,50] 20 1295 64.75000
21(40,50] [0,10] 20 958 47.90000
22(40,50](10,20)21 1168 55.61905
23(40,50)(20, 30] 20 1118 55.90000
24(40,50](30,40)20 1009 50.45000
25(40,50)(40,50] 20 934 46.70000


I'm a power excel pivot table user who is forcing himself to learn R. I know exactly how to do this analysis in excel, but can't figure out the right way to code this in R.

I'm trying to group user data by 2 different variables, while grouping the variables into ranges (or bins), then summarizing other variables.

Here is what the data looks like:

userid  visits  posts   revenue
1       25      0       25
2       2       2       0
3       86      7       8
4       128     24      94
5       30      5       18
…       …       …        …
280000  80      10      100
280001  42      4       25
280002  31      8       17

Here is what I am trying to get the output to look like:

VisitRange  PostRange   # of Users  Total Revenue   Average Revenue
0           0           X            Y              Z
1-10        0           X   Y   Z
11-20       0           X   Y   Z
21-30       0           X   Y   Z
31-40       0           X   Y   Z
41-50       0           X   Y   Z
> 50        0           X   Y   Z
0           1-10        X            Y              Z
1-10        1-10        X            Y              Z
11-20       1-10        X            Y              Z
21-30       1-10        X            Y              Z
31-40       1-10        X            Y              Z
41-50       1-10        X            Y              Z
> 50        1-10        X            Y              Z

want to group by visits and posts by 10 up to a certain level, then group anything higher than 50 as '> 51'

I've looked a tapply and ddply as ways to accomplish this, but I don't think they will work the way I am expecting, but I could be wrong.

Lastly, I know I could do this in SQL using and if/then statement to identify the range of visits and the range of posts (for example - If visits between 1 and 10, then '1-10'), then just group by visit range and post range, but my goal here is to start forcing myself to use R. Maybe R isn't the right tool here, but I think it is…

All help would be appreciated. Thanks in advance.

解决方案

The idiom in the plyr package and ddply in particular, is very similar to pivot tables in Excel.

In your example, the only thing you need to do is the cut your grouping variables into the desired breaks, before passing to ddply. Here is an example:

First, create some sample data:

set.seed(1)
dat <- data.frame(
  userid = 1:500,
  visits =sample(0:50, 500, replace=TRUE),
  posts = sample(0:50, 500, replace=TRUE),
  revenue = sample(1:100, replace=TRUE)
  )

Now, use cut to divide your grouping variables into the desired ranges:

dat$PostRange <- cut(dat$posts, breaks=seq(0, 50, 10), include.lowest=TRUE)
dat$VisitRange <- cut(dat$visits, breaks=seq(0, 50, 10), include.lowest=TRUE)

Finally, use ddply with summarise:

library(plyr)
ddply(dat, .(VisitRange, PostRange), 
      summarise, 
      Users=length(userid), 
      `Total Revenue`=sum(revenue), 
      `Average Revenue`=mean(revenue))

The results:

   VisitRange PostRange Users Total Revenue Average Revenue
1      [0,10]    [0,10]    23          1318        57.30435
2      [0,10]   (10,20]    23          1136        49.39130
3      [0,10]   (20,30]    28          1499        53.53571
4      [0,10]   (30,40]    20           923        46.15000
5      [0,10]   (40,50]    14           826        59.00000
6     (10,20]    [0,10]    23          1227        53.34783
7     (10,20]   (10,20]    17           642        37.76471
8     (10,20]   (20,30]    20           888        44.40000
9     (10,20]   (30,40]    15           622        41.46667
10    (10,20]   (40,50]    21           968        46.09524
11    (20,30]    [0,10]    23          1226        53.30435
12    (20,30]   (10,20]    19          1021        53.73684
13    (20,30]   (20,30]    23          1380        60.00000
14    (20,30]   (30,40]     8           313        39.12500
15    (20,30]   (40,50]    19           710        37.36842
16    (30,40]    [0,10]    18           782        43.44444
17    (30,40]   (10,20]    25          1308        52.32000
18    (30,40]   (20,30]    14           553        39.50000
19    (30,40]   (30,40]    26          1131        43.50000
20    (30,40]   (40,50]    20          1295        64.75000
21    (40,50]    [0,10]    20           958        47.90000
22    (40,50]   (10,20]    21          1168        55.61905
23    (40,50]   (20,30]    20          1118        55.90000
24    (40,50]   (30,40]    20          1009        50.45000
25    (40,50]   (40,50]    20           934        46.70000

这篇关于对R中的多个变量进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆