按组进行数据帧聚合 - 按照范围分隔列的值 [英] Dataframe Aggregation By Group - Separating a Column's Values by Ranges
问题描述
父< - c('a','b','c' ,'d',
'e','f','g','h',
'i','j','k','l',
' m','n','o','p',
'q','r','s','t',
'u','v' ,'x',
'y','z')
child< - c('A','B','C','D',
'E' 'F','G','H',
'I','J','K','L',
'M','N','O' ',
'Q','R','S','T',
'U','V','W','X',
' 'Z')
键入< - c('desktop','desktop','desktop','desktop',
'desktop','desktop','desktop','desktop',
'desktop','desktop','desktop','desktop',
'desktop','deskt 桌面,桌面,桌面,桌面,桌面,桌面,桌面,桌面,桌面 ,
'desktop','desktop')
Size< - c('MEDIUM','MEDIUM','LARGE','LARGE',
'SMALL','MEDIUM' 大,小,小,大,大,小,小,大,大 $ b'MEDIUM','SMALL','SMALL','MEDIUM',
'LARGE','MEDIUM','SMALL','MEDIUM',
'LARGE','MEDIUM'
收入< - c(22138.16,18617.94,12394.36,10535.76,
8901.41,7320.17,3821.40,2811.50,
2483.10,2145.76,2138.41,2037.67,
1950.52,1837.93,1737.68 ,1554.61,
1374.40,1334.02,1214.60,1191.41,
1189.56,1174.55,1162.80,1131.29,
1127.05,1108.53)
NumberofSales< - c(1954720,5129937,108104104,970326,
1608012,746613,333424,236643,
352294,587541,209218,342455,
192670 ,340580,275260,248049,
251790,128845,303515,112218,
149878,226633,194973,103425,
101819,114570)
Price <-C(11.325489, 3.629273,11.411762,10.857959
5.535661 9.804504 11.461083 11.880766
7.048374 3.652103 10.220966,5.550183
10.123631,5.36471,6.312868,6267350
5.458517,103.53681, 4.001779,10616924,
7.936855,5.182608,5.963908,10.938264,
11.069152,9.675570)
Opps< - c(5144351,6038044,23454341,4578272,
7197544,474510,1045528 ,181471,
1071631,801038,928563,477870,
590497,849537,410179,432703,
198 3993,330478,939806,191824,
283107,575004,256846,249530,
142318,2036363)
df< -data.frame(parent,child,Type,Size,
收入,销售数量,价格,运营商)
这是它的样子:
df
父子类型大小收入NumberofSales价格Opps
1 a A桌面MEDIUM 22138.16 1954720 11.325489 5144351
2 b B桌面MEDIUM 18617.94 5129937 3.629273 6038044
3 c C桌面大号12394.36 1086104 11.411762 2354341
4 d D桌面大号10535.76 970326 10.857959 4578272
5 e E桌面小8901.41 1608012 5.535661 7197544
6 f F桌面MEDIUM 7320.17 746613 9.804504 474510
7 g G桌面大3821.40 333424 11.461083 1045528
8小时H桌面小2811.50 236643 11.880766 181471
9我桌面媒体2483.10 352294 7.048374 1071631
10 j J桌面SMALL 2145.76 587541 3.652103 801038
11 k K桌面LARGE 2138.41 209218 10.220966 928563
12 l L桌面LARGE 2037.67 342455 5.950183 477870
13 m M桌面SMALL 1950.52 192670 10.123631 590497
14 n N桌面SMALL 1837.93 340580 5.396471 849537
15 o O桌面LARGE 1737.68 275260 6.312868 410179
16 p P桌面LARGE 1554.61 248049 6.267350 432703
17 q Q桌面媒体1374.40 251790 5.458517 1983993
18 r R桌面SMALL 1334.02 128845 10.353681 330478
19 s S桌面SMALL 1214.60 303515 4.001779 939806
20 t T桌面媒体1191.41 112218 10.616924 191824
21 u U desktop LARGE 1189.56 149878 7.936855 283107
22 v V桌面媒体1174.55 226633 5.182608 575004
23 w W桌面SMALL 1162.80 194973 5.963908 256846
24 x X桌面媒体1131.29 103425 10.938264 249530
25 y Y桌面LARGE 1127.05 101819 11.069152 142318
26 z Z桌面媒体1108.53 114570 9.675570 2036363
我想创建一个数据框显示的价格
BY 大小
和类型
包含这些价格
范围的所有适当指标。我想要最终的数据框看起来像这样。 (我没有做这个度量值的聚合,因为我目前正在做的太长了,这就是为什么他们现在都是一样的,但最终的答案应该有所有不同的值)
类型尺寸价格范围SUM_Opps SUM_NumberofSales SUM_Revenue
1桌面LARGE $ 3 $ 3.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 4- $ 4.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 5- $ 5.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 6- $ 6.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 7- $ 7.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 8- $ 8.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 9 $ 9.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 10- $ 10.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 11- $ 11.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 12- $ 12.99 9,143,587 2,531,983 $ 8,453.93
1桌面LARGE $ 13-大9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 3- $ 3.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 4- $ 4.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 5- $ 5.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 6- $ 6.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 7 - $ 7.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 8- $ 8.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 9- $ 9.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 10- $ 10.99 9,143,587 2,531,983 $ 8,453.93
1桌面中档$ 11- $ 11.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 12- $ 12.99 9,143,587 2,531,983 $ 8,453.93
1桌面媒体$ 13 - 大9,143,587 2,531,983 $ 8,453.93
1桌面小$ 3 $ 3.99 9,143,587 2,531,983 $ 8,453.93
1桌面SMALL $ 4 - $ 4.99 9,143,587 2,531,983 $ 8,453.93
1桌面SMALL $ 5- $ 5.99 9,143,587 2,531,983 $ 8,453.93
1桌面SMALL $ 6- $ 6.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 7 $ 7.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 8- $ 8.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 9 $ 9.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 10 $ 10.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 11- $ 11.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 12- $ 12.99 9,143,587 2,531,983 $ 8,453.93
1桌面小$ 13-大9,143,587 2,531,983 $ 8,453.93
如何创建上表?上表显示了 OPPS
,销售数量
和收入
BY 类型
,大小
和价格范围
。
我了解了如何使用dplyr进行简单的聚合,但艰难的部分是分配价格。
任何帮助都会很棒,谢谢!
使用 Hmisc :: cut2()
生成一个因子水平的价格仓:
code> library(Hmisc)
library(dplyr)
df $ cut_Price< - cut2(df $ Price,cutting = 4:13)
df%>%group_by(cut_Price,Size,Type)%>%
summarise_at(c(Opps,NumberofSales,Revenue),sum)%>%
布置(Size,cut_Price)%>%ungroup()%>%
mutate(cut_Price = gsub((。*,\\\\。))00,\ \199,cut_Price))
#一个字符串:16×6
cut_Price大小类型Opps Number ofSales收入
< chr> < FCTR> < FCTR> < DBL> < DBL> < DBL>
1 [5.00,6.99] LARGE桌面477870 342455 2037.67
2 [6.00,7.99] LARGE桌面842882 523309 3292.29
3 [7.00,8.99] LARGE桌面283107 149878 1189.56
4 [10.00,11.00] LARGE桌面5506835 1179544 12674.17
5 [11.00,12.00] LARGE桌面3542187 1521347 17342.81
6 [3.63,4.99]媒体桌面6038044 5129937 18617.94
7 [5.00,6.99]媒体桌面2558997 478423 2548.95
8 [7.00,8.99)媒体桌面1071631 352294 2483.10
9 [9.00,10.00]媒体桌面2510873 861183 8428.70
10 [10.00,11.00)媒体桌面441354 215643 2322.70
11 [11.00,12.00] MEDIUM桌面5144351 1954720 22138.16
12 [3.63,4.99)小桌面801038 587541 2145.76
13 [4.00,5.99]小桌面939806 303515 1214.60
14 [5.00,6.99]小桌面8303927 2143565 11902.14
15 [10.00,11.00)小桌面920975 321515 3284.54
16 [11.00,12.00]小桌面181471 236643 2811.50
如果你想调整削减每0.5而不是1,你可以这样做,因为它的向量传递给 cut = ...
正在定义切点:
df $ cut_Price< - cut2(df $ Price,cutting = seq(4,13,.5))
I have a dataframe as follows:
parent<- c('a', 'b', 'c', 'd',
'e', 'f', 'g', 'h',
'i', 'j', 'k', 'l',
'm', 'n', 'o', 'p',
'q', 'r', 's', 't',
'u', 'v', 'w', 'x',
'y', 'z')
child<- c('A', 'B', 'C', 'D',
'E', 'F', 'G', 'H',
'I', 'J', 'K', 'L',
'M', 'N', 'O', 'P',
'Q', 'R', 'S', 'T',
'U', 'V', 'W', 'X',
'Y', 'Z')
Type<- c('desktop', 'desktop', 'desktop', 'desktop',
'desktop', 'desktop', 'desktop', 'desktop',
'desktop', 'desktop', 'desktop', 'desktop',
'desktop', 'desktop', 'desktop', 'desktop',
'desktop', 'desktop', 'desktop', 'desktop',
'desktop', 'desktop', 'desktop', 'desktop',
'desktop', 'desktop')
Size<- c('MEDIUM', 'MEDIUM', 'LARGE', 'LARGE',
'SMALL', 'MEDIUM', 'LARGE', 'SMALL',
'MEDIUM', 'SMALL', 'LARGE', 'LARGE',
'SMALL', 'SMALL', 'LARGE', 'LARGE',
'MEDIUM', 'SMALL', 'SMALL', 'MEDIUM',
'LARGE', 'MEDIUM', 'SMALL', 'MEDIUM',
'LARGE', 'MEDIUM')
Revenue<- c(22138.16, 18617.94, 12394.36, 10535.76,
8901.41, 7320.17, 3821.40, 2811.50,
2483.10, 2145.76, 2138.41, 2037.67,
1950.52, 1837.93, 1737.68, 1554.61,
1374.40, 1334.02, 1214.60, 1191.41,
1189.56, 1174.55, 1162.80, 1131.29,
1127.05, 1108.53)
NumberofSales<- c(1954720, 5129937, 1086104, 970326,
1608012, 746613, 333424, 236643,
352294, 587541, 209218, 342455,
192670, 340580, 275260, 248049,
251790, 128845, 303515, 112218,
149878, 226633, 194973, 103425,
101819, 114570)
Price<- c(11.325489, 3.629273, 11.411762, 10.857959,
5.535661, 9.804504, 11.461083, 11.880766,
7.048374, 3.652103, 10.220966, 5.950183,
10.123631, 5.396471, 6.312868, 6.267350,
5.458517, 10.353681, 4.001779, 10.616924,
7.936855, 5.182608, 5.963908, 10.938264,
11.069152, 9.675570)
Opps<- c(5144351, 6038044, 2354341, 4578272,
7197544, 474510, 1045528, 181471,
1071631, 801038, 928563, 477870,
590497, 849537, 410179, 432703,
1983993, 330478, 939806, 191824,
283107, 575004, 256846, 249530,
142318, 2036363)
df<-data.frame(parent, child, Type, Size,
Revenue, NumberofSales, Price, Opps)
This is what it looks like:
df
parent child Type Size Revenue NumberofSales Price Opps
1 a A desktop MEDIUM 22138.16 1954720 11.325489 5144351
2 b B desktop MEDIUM 18617.94 5129937 3.629273 6038044
3 c C desktop LARGE 12394.36 1086104 11.411762 2354341
4 d D desktop LARGE 10535.76 970326 10.857959 4578272
5 e E desktop SMALL 8901.41 1608012 5.535661 7197544
6 f F desktop MEDIUM 7320.17 746613 9.804504 474510
7 g G desktop LARGE 3821.40 333424 11.461083 1045528
8 h H desktop SMALL 2811.50 236643 11.880766 181471
9 i I desktop MEDIUM 2483.10 352294 7.048374 1071631
10 j J desktop SMALL 2145.76 587541 3.652103 801038
11 k K desktop LARGE 2138.41 209218 10.220966 928563
12 l L desktop LARGE 2037.67 342455 5.950183 477870
13 m M desktop SMALL 1950.52 192670 10.123631 590497
14 n N desktop SMALL 1837.93 340580 5.396471 849537
15 o O desktop LARGE 1737.68 275260 6.312868 410179
16 p P desktop LARGE 1554.61 248049 6.267350 432703
17 q Q desktop MEDIUM 1374.40 251790 5.458517 1983993
18 r R desktop SMALL 1334.02 128845 10.353681 330478
19 s S desktop SMALL 1214.60 303515 4.001779 939806
20 t T desktop MEDIUM 1191.41 112218 10.616924 191824
21 u U desktop LARGE 1189.56 149878 7.936855 283107
22 v V desktop MEDIUM 1174.55 226633 5.182608 575004
23 w W desktop SMALL 1162.80 194973 5.963908 256846
24 x X desktop MEDIUM 1131.29 103425 10.938264 249530
25 y Y desktop LARGE 1127.05 101819 11.069152 142318
26 z Z desktop MEDIUM 1108.53 114570 9.675570 2036363
I want to create a dataframe that shows the distribution of Price
BY Size
and Type
with all of the appropriate metrics for these Price
ranges. I want the final dataframe to look like this. ( I didn't do the aggregation for the metric values because it takes way too long the way I am currently doing it, that's why they are all the same right now but the final answer should have all different values)
Type Size Price Range SUM_Opps SUM_NumberofSales SUM_Revenue
1 desktop LARGE $3-$3.99 9,143,587 2,531,983 $8,453.93
1 desktop LARGE $4-$4.99 9,143,587 2,531,983 $8,453.93
1 desktop LARGE $5-$5.99 9,143,587 2,531,983 $8,453.93
1 desktop LARGE $6-$6.99 9,143,587 2,531,983 $8,453.93
1 desktop LARGE $7-$7.99 9,143,587 2,531,983 $8,453.93
1 desktop LARGE $8-$8.99 9,143,587 2,531,983 $8,453.93
1 desktop LARGE $9-$9.99 9,143,587 2,531,983 $8,453.93
1 desktop LARGE $10-$10.99 9,143,587 2,531,983 $8,453.93
1 desktop LARGE $11-$11.99 9,143,587 2,531,983 $8,453.93
1 desktop LARGE $12-$12.99 9,143,587 2,531,983 $8,453.93
1 desktop LARGE $13-Greater 9,143,587 2,531,983 $8,453.93
1 desktop MEDIUM $3-$3.99 9,143,587 2,531,983 $8,453.93
1 desktop MEDIUM $4-$4.99 9,143,587 2,531,983 $8,453.93
1 desktop MEDIUM $5-$5.99 9,143,587 2,531,983 $8,453.93
1 desktop MEDIUM $6-$6.99 9,143,587 2,531,983 $8,453.93
1 desktop MEDIUM $7-$7.99 9,143,587 2,531,983 $8,453.93
1 desktop MEDIUM $8-$8.99 9,143,587 2,531,983 $8,453.93
1 desktop MEDIUM $9-$9.99 9,143,587 2,531,983 $8,453.93
1 desktop MEDIUM $10-$10.99 9,143,587 2,531,983 $8,453.93
1 desktop MEDIUM $11-$11.99 9,143,587 2,531,983 $8,453.93
1 desktop MEDIUM $12-$12.99 9,143,587 2,531,983 $8,453.93
1 desktop MEDIUM $13-Greater 9,143,587 2,531,983 $8,453.93
1 desktop SMALL $3-$3.99 9,143,587 2,531,983 $8,453.93
1 desktop SMALL $4-$4.99 9,143,587 2,531,983 $8,453.93
1 desktop SMALL $5-$5.99 9,143,587 2,531,983 $8,453.93
1 desktop SMALL $6-$6.99 9,143,587 2,531,983 $8,453.93
1 desktop SMALL $7-$7.99 9,143,587 2,531,983 $8,453.93
1 desktop SMALL $8-$8.99 9,143,587 2,531,983 $8,453.93
1 desktop SMALL $9-$9.99 9,143,587 2,531,983 $8,453.93
1 desktop SMALL $10-$10.99 9,143,587 2,531,983 $8,453.93
1 desktop SMALL $11-$11.99 9,143,587 2,531,983 $8,453.93
1 desktop SMALL $12-$12.99 9,143,587 2,531,983 $8,453.93
1 desktop SMALL $13-Greater 9,143,587 2,531,983 $8,453.93
How do I create the table above? The table above is showing the sum of OPPS
, Number of Sales
, and Revenue
BY Type
, Size
, and Price Range
.
I understand how to use dplyr to do the simple aggregation but the tough part is doing the distribution of prices.
Any help would be great, thanks!
you could use Hmisc::cut2()
to generate you price bins as levels of a factor:
library(Hmisc)
library(dplyr)
df$cut_Price <- cut2(df$Price, cuts = 4:13)
df %>% group_by(cut_Price, Size, Type) %>%
summarise_at(c("Opps", "NumberofSales", "Revenue"),"sum") %>%
arrange(Size, cut_Price) %>% ungroup() %>%
mutate(cut_Price = gsub("(.*, \\d\\.)00", "\\199", cut_Price))
# A tibble: 16 × 6
cut_Price Size Type Opps NumberofSales Revenue
<chr> <fctr> <fctr> <dbl> <dbl> <dbl>
1 [ 5.00, 6.99) LARGE desktop 477870 342455 2037.67
2 [ 6.00, 7.99) LARGE desktop 842882 523309 3292.29
3 [ 7.00, 8.99) LARGE desktop 283107 149878 1189.56
4 [10.00,11.00) LARGE desktop 5506835 1179544 12674.17
5 [11.00,12.00) LARGE desktop 3542187 1521347 17342.81
6 [ 3.63, 4.99) MEDIUM desktop 6038044 5129937 18617.94
7 [ 5.00, 6.99) MEDIUM desktop 2558997 478423 2548.95
8 [ 7.00, 8.99) MEDIUM desktop 1071631 352294 2483.10
9 [ 9.00,10.00) MEDIUM desktop 2510873 861183 8428.70
10 [10.00,11.00) MEDIUM desktop 441354 215643 2322.70
11 [11.00,12.00) MEDIUM desktop 5144351 1954720 22138.16
12 [ 3.63, 4.99) SMALL desktop 801038 587541 2145.76
13 [ 4.00, 5.99) SMALL desktop 939806 303515 1214.60
14 [ 5.00, 6.99) SMALL desktop 8303927 2143565 11902.14
15 [10.00,11.00) SMALL desktop 920975 321515 3284.54
16 [11.00,12.00) SMALL desktop 181471 236643 2811.50
if you want to adjust the cuts to every 0.5 instead of 1, you could do this since its the vector passed to cut = ...
is defining the "cut points":
df$cut_Price <- cut2(df$Price, cuts = seq(4,13,.5))
这篇关于按组进行数据帧聚合 - 按照范围分隔列的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!