在R中对分组/箱/桶数据进行分组并获取每个桶的计数以及每个桶的值总和 [英] Group/bin/bucket data in R and get count per bucket and sum of values per bucket

查看:233
本文介绍了在R中对分组/箱/桶数据进行分组并获取每个桶的计数以及每个桶的值总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望存储/分组/合并数据:

I wish to bucket/group/bin data :

C1             C2       C3
49488.01172    0.0512   54000
268221.1563    0.0128   34399
34775.96094    0.0128   54444
13046.98047    0.07241  61000
2121699.75     0.00453  78921
71155.09375    0.0181   13794
1369809.875    0.00453  12312
750            0.2048   43451
44943.82813    0.0362   49871
85585.04688    0.0362   18947
31090.10938    0.0362   13401
68550.40625    0.0181   14345

我想按C2值对它进行存储,但是我想定义存储区,例如< = 0.005,< =。010,< =。014等。如您所见,存储周期是不均匀的间隔。我想要每个存储桶的C1计数以及每个存储桶的C1总数。

I want to bucket it by C2 values but I wish to define the buckets e.g. <=0.005, <=.010, <=.014 etc. As you can see, the bucketing will be uneven intervals. I want the count of C1 per bucket as well as the total sum of C1 for every bucket.

我不知道从哪里开始,因为我是一位新用户是否有任何人愿意帮助我找出代码或指导我找到适合我需要的示例?

I don't know where to begin as I am fairly new a user of R. Is there anyone willing to help me figure out the code or direct to me to an example that will work for my needs?

编辑:在C3中添加了另一列。我同时需要每个存储桶C3的总和以及每个存储桶C1的总和和计数

added another column C3. I need sum of C3 per bucket as well at the same time as sum and count of C1 per bucket

推荐答案

C2似乎是字符列,后缀为。在创建组之前,使用 sub 删除,转换为数字( as.numeric )。通过使用函数 cut 与< transform(df,...)创建变量 group code> breaks (组存储区/间隔)和 labels (用于所需的组标签)参数。一旦创建了组变量,就可以使用<$ c $来完成 C1(按 group)的 sum 和 group中元素的 count的操作。 c>从 base R聚合

From the comments, "C2" seems to be "character" column with % as suffix. Before, creating a group, remove the % using sub, convert to "numeric" (as.numeric). The variable "group" is created (transform(df,...)) by using the function cut with breaks (group buckets/intervals) and labels (for the desired group labels) arguments. Once the group variable is created, the sum of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate from "base R"

df1 <-  transform(df, group=cut(as.numeric(sub('[%]', '', C2)), 
    breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
      labels=c('<0.005', 0.005, 0.01, 0.014)))

 res <- do.call(data.frame,aggregate(C1~group, df1, 
        FUN=function(x) c(Count=length(x), Sum=sum(x))))

 dNew <- data.frame(group=levels(df1$group))
 merge(res, dNew, all=TRUE)
 #   group C1.Count    C1.Sum
 #1 <0.005        2 3491509.6
 #2  0.005       NA        NA
 #3   0.01        2  302997.1
 #4  0.014        8  364609.5

或者您可以使用 data.table setDT data.frame 转换为 data.table 。用 by = 指定分组变量,并在列表( .N 给出每个组中元素的数量。

or you can use data.table. setDT converts the data.frame to data.table. Specify the "grouping" variable with by= and summarize/create the two variables "Count" and "Sum" within the list(. .N gives the count of elements within each "group".

 library(data.table)
  setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]

或使用 dplyr %>%将LHS与RHS参数连接在一起并将它们链接在一起,使用 group_by 指定 group变量,然后使用 summarise_each summary 来获取有关列的摘要计数和 sum 。 > summarise_each 如果有多于一列会很有用。

Or using dplyr. The %>% connect the LHS with RHS arguments and chains them together. Use group_by to specify the "group" variable, and then use summarise_each or summarise to get summary count and sum of the concerned column. summarise_each would be useful if there are more than one column.

 library(dplyr)
 df1 %>%
      group_by(group) %>% 
      summarise_each(funs(n(), Sum=sum(.)), C1)



更新



使用新数据集 df

df1 <- transform(df, group=cut(C2,  breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
                             labels=c('<0.005', 0.005, 0.01, 0.014)))

res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1, 
       FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
#  group C1.Count    C1.Sum C3.Count C3.Sum
#1 <0.005        2 3491509.6        2  91233
#2   0.01        2  302997.1        2  88843
#3  0.014        8  364609.5        8 268809

,您可以执行合并,如上所述。

and you can do the merge as detailed above.

除了指定附加变量之外, dplyr 方法将相同

The dplyr approach would be the same except specifying the additional variable

 df1%>%
      group_by(group) %>%
       summarise_each(funs(n(), Sum=sum(.)), C1, C3)
 #Source: local data frame [3 x 5]

 #  group C1_n C3_n    C1_Sum C3_Sum
 #1 <0.005    2    2 3491509.6  91233
 #2   0.01    2    2  302997.1  88843
 #3  0.014    8    8  364609.5 268809



数据



data

df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094, 
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813, 
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%", 
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%", 
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1", 
"C2"), row.names = c(NA, -12L), class = "data.frame")

这篇关于在R中对分组/箱/桶数据进行分组并获取每个桶的计数以及每个桶的值总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆