按组进行数据帧聚合 - 按照范围分隔列的值 [英] Dataframe Aggregation By Group - Separating a Column's Values by Ranges

查看:151
本文介绍了按组进行数据帧聚合 - 按照范围分隔列的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框如下:

 父<  -  c('a','b','c' ,'d',
'e','f','g','h',
'i','j','k','l',
' m','n','o','p',
'q','r','s','t',
'u','v' ,'x',
'y','z')
child< - c('A','B','C','D',
'E' 'F','G','H',
'I','J','K','L',
'M','N','O' ',
'Q','R','S','T',
'U','V','W','X',
' 'Z')
键入< - c('desktop','desktop','desktop','desktop',
'desktop','desktop','desktop','desktop',
'desktop','desktop','desktop','desktop',
'desktop','deskt 桌面,桌面,桌面,桌面,桌面,桌面,桌面,桌面,桌面 ,
'desktop','desktop')
Size< - c('MEDIUM','MEDIUM','LARGE','LARGE',
'SMALL','MEDIUM' 大,小,小,大,大,小,小,大,大 $ b'MEDIUM','SMALL','SMALL','MEDIUM',
'LARGE','MEDIUM','SMALL','MEDIUM',
'LARGE','MEDIUM'
收入< - c(22138.16,18617.94,12394.36,10535.76,
8901.41,7320.17,3821.40,2811.50,
2483.10,2145.76,2138.41,2037.67,
1950.52,1837.93,1737.68 ,1554.61,
1374.40,1334.02,1214.60,1191.41,
1189.56,1174.55,1162.80,1131.29,
1127.05,1108.53)
NumberofSales< - c(1954720,5129937,108104104,970326,
1608012,746613,333424,236643,
352294,587541,209218,342455,
192670 ,340580,275260,248049,
251790,128845,303515,112218,
149878,226633,194973,103425,
101819,114570)
Price <-C(11.325489, 3.629273,11.411762,10.857959
5.535661 9.804504 11.461083 11.880766
7.048374 3.652103 10.220966,5.550183
10.123631,5.36471,6.312868,6267350
5.458517,103.53681, 4.001779,10616924,
7.936855,5.182608,5.963908,10.938264,
11.069152,9.675570)
Opps< - c(5144351,6038044,23454341,4578272,
7197544,474510,1045528 ,181471,
1071631,801038,928563,477870,
590497,849537,410179,432703,
198 3993,330478,939806,191824,
283107,575004,256846,249530,
142318,2036363)
df< -data.frame(parent,child,Type,Size,
收入,销售数量,价格,运营商)

这是它的样子:

  df 

父子类型大小收入NumberofSales价格Opps
1 a A桌面MEDIUM 22138.16 1954720 11.325489 5144351
2 b B桌面MEDIUM 18617.94 5129937 3.629273 6038044
3 c C桌面大号12394.36 1086104 11.411762 2354341
4 d D桌面大号10535.76 970326 10.857959 4578272
5 e E桌面小8901.41 1608012 5.535661 7197544
6 f F桌面MEDIUM 7320.17 746613 9.804504 474510
7 g G桌面大3821.40 333424 11.461083 1045528
8小时H桌面小2811.50 236643 11.880766 181471
9我桌面媒体2483.10 352294 7.048374 1071631
10 j J桌面SMALL 2145.76 587541 3.652103 801038
11 k K桌面LARGE 2138.41 209218 10.220966 928563
12 l L桌面LARGE 2037.67 342455 5.950183 477870
13 m M桌面SMALL 1950.52 192670 10.123631 590497
14 n N桌面SMALL 1837.93 340580 5.396471 849537
15 o O桌面LARGE 1737.68 275260 6.312868 410179
16 p P桌面LARGE 1554.61 248049 6.267350 432703
17 q Q桌面媒体1374.40 251790 5.458517 1983993
18 r R桌面SMALL 1334.02 128845 10.353681 330478
19 s S桌面SMALL 1214.60 303515 4.001779 939806
20 t T桌面媒体1191.41 112218 10.616924 191824
21 u U desktop LARGE 1189.56 149878 7.936855 283107
22 v V桌面媒体1174.55 226633 5.182608 575004
23 w W桌面SMALL 1162.80 194973 5.963908 256846
24 x X桌面媒体1131.29 103425 10.938264 249530
25 y Y桌面LARGE 1127.05 101819 11.069152 142318
26 z Z桌面媒体1108.53 114570 9.675570 2036363

我想创建一个数据框显示的价格 BY 大小类型包含这些价格范围的所有适当指标。我想要最终的数据框看起来像这样。 (我没有做这个度量值的聚合,因为我目前正在做的太长了,这就是为什么他们现在都是一样的,但最终的答案应该有所有不同的值)

 类型尺寸价格范围SUM_Opps SUM_NumberofSales SUM_Revenue 
1桌面LARGE $ 3 $ 3.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 4- $ 4.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 5- $ 5.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 6- $ 6.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 7- $ 7.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 8- $ 8.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 9 $ 9.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 10- $ 10.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 11- $ 11.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 12- $ 12.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 13-大9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 3- $ 3.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 4- $ 4.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 5- $ 5.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 6- $ 6.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 7 - $ 7.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 8- $ 8.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 9- $ 9.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 10- $ 10.99 9,143,587 2,531,983 $ 8,453.93
1桌面中档$ 11- $ 11.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 12- $ 12.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 13 - 大9,143,587 2,531,983 $ 8,453.93
1桌面小$ 3 $ 3.99 9,143,587 2,531,983 $ 8,453.93
1桌面SMALL $ 4 - $ 4.99 9,143,587 2,531,983 $ 8,453.93
1桌面SMALL $ 5- $ 5.99 9,143,587 2,531,983 $ 8,453.93
1桌面SMALL $ 6- $ 6.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 7 $ 7.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 8- $ 8.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 9 $ 9.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 10 $ 10.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 11- $ 11.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 12- $ 12.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 13-大9,143,587 2,531,983 $ 8,453.93

如何创建上表?上表显示了 OPPS 销售数量收入 BY 类型大小价格范围



我了解了如何使用dplyr进行简单的聚合,但艰难的部分是分配价格。



任何帮助都会很棒,谢谢!

解决方案

使用 Hmisc :: cut2()生成一个因子水平的价格仓:

 code> library(Hmisc)
library(dplyr)

df $ cut_Price< - cut2(df $ Price,cutting = 4:13)

df%>%group_by(cut_Price,Size,Type)%>%
summarise_at(c(Opps,NumberofSales,Revenue),sum)%>%
布置(Size,cut_Price)%>%ungroup()%>%
mutate(cut_Price = gsub((。*,\\\\。))00,\ \199,cut_Price))

#一个字符串:16×6
cut_Price大小类型Opps Number ofSales收入
< chr> < FCTR> < FCTR> < DBL> < DBL> < DBL>
1 [5.00,6.99] LARGE桌面477870 342455 2037.67
2 [6.00,7.99] LARGE桌面842882 523309 3292.29
3 [7.00,8.99] LARGE桌面283107 149878 1189.56
4 [10.00,11.00] LARGE桌面5506835 1179544 12674.17
5 [11.00,12.00] LARGE桌面3542187 1521347 17342.81
6 [3.63,4.99]媒体桌面6038044 5129937 18617.94
7 [5.00,6.99]媒体桌面2558997 478423 2548.95
8 [7.00,8.99)媒体桌面1071631 352294 2483.10
9 [9.00,10.00]媒体桌面2510873 861183 8428.70
10 [10.00,11.00)媒体桌面441354 215643 2322.70
11 [11.00,12.00] MEDIUM桌面5144351 1954720 22138.16
12 [3.63,4.99)小桌面801038 587541 2145.76
13 [4.00,5.99]小桌面939806 303515 1214.60
14 [5.00,6.99]小桌面8303927 2143565 11902.14
15 [10.00,11.00)小桌面920975 321515 3284.54
16 [11.00,12.00]小桌面181471 236643 2811.50

如果你想调整削减每0.5而不是1,你可以这样做,因为它的向量传递给 cut = ... 正在定义切点:

  df $ cut_Price<  -  cut2(df $ Price,cutting = seq(4,13,.5))


I have a dataframe as follows:

parent<- c('a', 'b', 'c', 'd', 
         'e', 'f', 'g', 'h', 
         'i', 'j', 'k', 'l',
         'm', 'n', 'o', 'p',
         'q', 'r', 's', 't', 
         'u', 'v', 'w', 'x',
         'y', 'z')
child<- c('A', 'B', 'C', 'D', 
         'E', 'F', 'G', 'H', 
         'I', 'J', 'K', 'L',
         'M', 'N', 'O', 'P',
         'Q', 'R', 'S', 'T', 
         'U', 'V', 'W', 'X',
         'Y', 'Z')
Type<- c('desktop', 'desktop', 'desktop', 'desktop', 
         'desktop', 'desktop', 'desktop', 'desktop', 
         'desktop', 'desktop', 'desktop', 'desktop',
         'desktop', 'desktop', 'desktop', 'desktop',
         'desktop', 'desktop', 'desktop', 'desktop', 
         'desktop', 'desktop', 'desktop', 'desktop',
         'desktop', 'desktop')
Size<- c('MEDIUM', 'MEDIUM', 'LARGE', 'LARGE', 
         'SMALL', 'MEDIUM', 'LARGE', 'SMALL', 
         'MEDIUM', 'SMALL', 'LARGE', 'LARGE',
         'SMALL', 'SMALL', 'LARGE', 'LARGE',
         'MEDIUM', 'SMALL', 'SMALL', 'MEDIUM', 
         'LARGE', 'MEDIUM', 'SMALL', 'MEDIUM',
         'LARGE', 'MEDIUM')
Revenue<- c(22138.16, 18617.94, 12394.36, 10535.76, 
         8901.41, 7320.17, 3821.40, 2811.50, 
         2483.10, 2145.76, 2138.41, 2037.67,
         1950.52, 1837.93, 1737.68, 1554.61,
         1374.40, 1334.02, 1214.60, 1191.41, 
         1189.56, 1174.55, 1162.80, 1131.29,
         1127.05, 1108.53)
NumberofSales<- c(1954720, 5129937, 1086104, 970326, 
                  1608012, 746613, 333424, 236643, 
                  352294, 587541, 209218, 342455,
                  192670, 340580, 275260, 248049,
                  251790, 128845, 303515, 112218, 
                  149878, 226633, 194973, 103425,
                  101819, 114570)
Price<- c(11.325489, 3.629273, 11.411762, 10.857959, 
          5.535661, 9.804504, 11.461083, 11.880766, 
          7.048374, 3.652103, 10.220966, 5.950183,
          10.123631, 5.396471, 6.312868, 6.267350,
          5.458517, 10.353681, 4.001779, 10.616924, 
          7.936855, 5.182608, 5.963908, 10.938264,
          11.069152, 9.675570)
Opps<- c(5144351, 6038044, 2354341, 4578272, 
         7197544, 474510, 1045528, 181471, 
         1071631, 801038, 928563, 477870,
         590497, 849537, 410179, 432703,
         1983993, 330478, 939806, 191824, 
         283107, 575004, 256846, 249530,
         142318, 2036363)
df<-data.frame(parent, child, Type, Size, 
               Revenue, NumberofSales, Price, Opps)

This is what it looks like:

df

   parent child    Type   Size  Revenue NumberofSales     Price    Opps
1       a     A desktop MEDIUM 22138.16       1954720 11.325489 5144351
2       b     B desktop MEDIUM 18617.94       5129937  3.629273 6038044
3       c     C desktop  LARGE 12394.36       1086104 11.411762 2354341
4       d     D desktop  LARGE 10535.76        970326 10.857959 4578272
5       e     E desktop  SMALL  8901.41       1608012  5.535661 7197544
6       f     F desktop MEDIUM  7320.17        746613  9.804504  474510
7       g     G desktop  LARGE  3821.40        333424 11.461083 1045528
8       h     H desktop  SMALL  2811.50        236643 11.880766  181471
9       i     I desktop MEDIUM  2483.10        352294  7.048374 1071631
10      j     J desktop  SMALL  2145.76        587541  3.652103  801038
11      k     K desktop  LARGE  2138.41        209218 10.220966  928563
12      l     L desktop  LARGE  2037.67        342455  5.950183  477870
13      m     M desktop  SMALL  1950.52        192670 10.123631  590497
14      n     N desktop  SMALL  1837.93        340580  5.396471  849537
15      o     O desktop  LARGE  1737.68        275260  6.312868  410179
16      p     P desktop  LARGE  1554.61        248049  6.267350  432703
17      q     Q desktop MEDIUM  1374.40        251790  5.458517 1983993
18      r     R desktop  SMALL  1334.02        128845 10.353681  330478
19      s     S desktop  SMALL  1214.60        303515  4.001779  939806
20      t     T desktop MEDIUM  1191.41        112218 10.616924  191824
21      u     U desktop  LARGE  1189.56        149878  7.936855  283107
22      v     V desktop MEDIUM  1174.55        226633  5.182608  575004
23      w     W desktop  SMALL  1162.80        194973  5.963908  256846
24      x     X desktop MEDIUM  1131.29        103425 10.938264  249530
25      y     Y desktop  LARGE  1127.05        101819 11.069152  142318
26      z     Z desktop MEDIUM  1108.53        114570  9.675570 2036363

I want to create a dataframe that shows the distribution of Price BY Size and Type with all of the appropriate metrics for these Price ranges. I want the final dataframe to look like this. ( I didn't do the aggregation for the metric values because it takes way too long the way I am currently doing it, that's why they are all the same right now but the final answer should have all different values)

       Type    Size     Price Range    SUM_Opps   SUM_NumberofSales  SUM_Revenue 
1   desktop   LARGE        $3-$3.99   9,143,587           2,531,983    $8,453.93
1   desktop   LARGE        $4-$4.99   9,143,587           2,531,983    $8,453.93
1   desktop   LARGE        $5-$5.99   9,143,587           2,531,983    $8,453.93
1   desktop   LARGE        $6-$6.99   9,143,587           2,531,983    $8,453.93
1   desktop   LARGE        $7-$7.99   9,143,587           2,531,983    $8,453.93
1   desktop   LARGE        $8-$8.99   9,143,587           2,531,983    $8,453.93
1   desktop   LARGE        $9-$9.99   9,143,587           2,531,983    $8,453.93
1   desktop   LARGE      $10-$10.99   9,143,587           2,531,983    $8,453.93
1   desktop   LARGE      $11-$11.99   9,143,587           2,531,983    $8,453.93
1   desktop   LARGE      $12-$12.99   9,143,587           2,531,983    $8,453.93
1   desktop   LARGE     $13-Greater   9,143,587           2,531,983    $8,453.93
1   desktop  MEDIUM        $3-$3.99   9,143,587           2,531,983    $8,453.93
1   desktop  MEDIUM        $4-$4.99   9,143,587           2,531,983    $8,453.93
1   desktop  MEDIUM        $5-$5.99   9,143,587           2,531,983    $8,453.93
1   desktop  MEDIUM        $6-$6.99   9,143,587           2,531,983    $8,453.93
1   desktop  MEDIUM        $7-$7.99   9,143,587           2,531,983    $8,453.93
1   desktop  MEDIUM        $8-$8.99   9,143,587           2,531,983    $8,453.93
1   desktop  MEDIUM        $9-$9.99   9,143,587           2,531,983    $8,453.93
1   desktop  MEDIUM      $10-$10.99   9,143,587           2,531,983    $8,453.93
1   desktop  MEDIUM      $11-$11.99   9,143,587           2,531,983    $8,453.93
1   desktop  MEDIUM      $12-$12.99   9,143,587           2,531,983    $8,453.93
1   desktop  MEDIUM     $13-Greater   9,143,587           2,531,983    $8,453.93
1   desktop   SMALL        $3-$3.99   9,143,587           2,531,983    $8,453.93
1   desktop   SMALL        $4-$4.99   9,143,587           2,531,983    $8,453.93
1   desktop   SMALL        $5-$5.99   9,143,587           2,531,983    $8,453.93
1   desktop   SMALL        $6-$6.99   9,143,587           2,531,983    $8,453.93
1   desktop   SMALL        $7-$7.99   9,143,587           2,531,983    $8,453.93
1   desktop   SMALL        $8-$8.99   9,143,587           2,531,983    $8,453.93
1   desktop   SMALL        $9-$9.99   9,143,587           2,531,983    $8,453.93
1   desktop   SMALL      $10-$10.99   9,143,587           2,531,983    $8,453.93
1   desktop   SMALL      $11-$11.99   9,143,587           2,531,983    $8,453.93
1   desktop   SMALL      $12-$12.99   9,143,587           2,531,983    $8,453.93
1   desktop   SMALL     $13-Greater   9,143,587           2,531,983    $8,453.93

How do I create the table above? The table above is showing the sum of OPPS, Number of Sales, and Revenue BY Type, Size, and Price Range.

I understand how to use dplyr to do the simple aggregation but the tough part is doing the distribution of prices.

Any help would be great, thanks!

解决方案

you could use Hmisc::cut2() to generate you price bins as levels of a factor:

library(Hmisc)
library(dplyr)

df$cut_Price <- cut2(df$Price, cuts = 4:13)

df %>% group_by(cut_Price, Size, Type) %>%
    summarise_at(c("Opps", "NumberofSales", "Revenue"),"sum") %>%
    arrange(Size, cut_Price) %>% ungroup() %>%
    mutate(cut_Price = gsub("(.*, \\d\\.)00", "\\199", cut_Price))

 # A tibble: 16 × 6
       cut_Price   Size    Type    Opps NumberofSales  Revenue
           <chr> <fctr>  <fctr>   <dbl>         <dbl>    <dbl>
1  [ 5.00, 6.99)  LARGE desktop  477870        342455  2037.67
2  [ 6.00, 7.99)  LARGE desktop  842882        523309  3292.29
3  [ 7.00, 8.99)  LARGE desktop  283107        149878  1189.56
4  [10.00,11.00)  LARGE desktop 5506835       1179544 12674.17
5  [11.00,12.00)  LARGE desktop 3542187       1521347 17342.81
6  [ 3.63, 4.99) MEDIUM desktop 6038044       5129937 18617.94
7  [ 5.00, 6.99) MEDIUM desktop 2558997        478423  2548.95
8  [ 7.00, 8.99) MEDIUM desktop 1071631        352294  2483.10
9  [ 9.00,10.00) MEDIUM desktop 2510873        861183  8428.70
10 [10.00,11.00) MEDIUM desktop  441354        215643  2322.70
11 [11.00,12.00) MEDIUM desktop 5144351       1954720 22138.16
12 [ 3.63, 4.99)  SMALL desktop  801038        587541  2145.76
13 [ 4.00, 5.99)  SMALL desktop  939806        303515  1214.60
14 [ 5.00, 6.99)  SMALL desktop 8303927       2143565 11902.14
15 [10.00,11.00)  SMALL desktop  920975        321515  3284.54
16 [11.00,12.00)  SMALL desktop  181471        236643  2811.50

if you want to adjust the cuts to every 0.5 instead of 1, you could do this since its the vector passed to cut = ... is defining the "cut points":

df$cut_Price <- cut2(df$Price, cuts = seq(4,13,.5))

这篇关于按组进行数据帧聚合 - 按照范围分隔列的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆