使用data.table聚合 [英] Aggregate using data.table

查看:145
本文介绍了使用data.table聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种更简单的方法来汇总和计算使用 data.table 的数值变量的百分比。
以下代码输出所需的结果,我的问题是如果有一个更好的方法来获得相同的结果。我不是真的熟悉的包,所以任何提示将是有用的。



我想拥有以下列:

  second_factor_variable third_factor_variable factor_variable porc porcentaje 
1:HIGH C> 200 0.04456544 4%
2:低A 51 - 100 0.31739130 32%
3:低A 101 - 200 0.68260870 68%
4:低A 26 - 50 0.00000000 0%

其中 porc 是数字百分比, porcentage

 库(ggplot2)
库(ggplot2 scale)
库(data.table)

###生成一些数据
set.seed(123)
df< - data.frame (x = rnorm(10000,mean = 100,sd = 50))
df <子集(df,x> 0)

df $ factor_variable < $ x,right = TRUE,
breaks = c(0,25,50,100,200,100000),
labels = c(0 - 25,26 - 50,51 - 100,101 - 200,> 200)


df $ second_factor_variable = c(0,100,100000),
labels = c(LOW,HIGH)


df $ third_factor_variable< - cut ,right = TRUE,
breaks = c(0,50,100,100000),
labels = c(A,B,C)
)$ b b
str(df)

### Aggregate
DT < - data.table(df)
dt = DT [,list(factor_variable = unique DT $ factor_variable),
porc = as.numeric(table(factor_variable)/ length(factor_variable)),
porcentaje = paste(round(as.numeric 0)* 100),%)
),by =second_factor_variable,third_factor_variable]



EDIT



我试过用一个变量的agstudy解决方案分组,我相信它没有生产标签(porcentaje列)。在实际数据集中,我最终遇到了类似的问题,我不能发现这个函数的错误。

  grp< ;  -  function(factor_variable){
porc = as.numeric(table(factor_variable)/ length(factor_variable))
list(factor_variable = factor_variable [1],
porc = porc,
porcentaje = paste(round(porc,0)* 100,%))
}

DT [,grp(factor_variable),by =second_factor_variable]

数值是正确的

  DT2 <-DT [DT $ second_factor_variable%in%LOW] 
表(DT2 $ factor_variable)/长度(DT2 $ factor_variable)
/ pre>

我相信如果我用2个因子变量分组,会出现相同的问题:

  DT [,grp(factor_variable),by =second_factor_variable,third_factor_variable] 


解决方案

2更改:factorize porc 变量,不使用DT计算factor_variable

  DT [,{porc = as.numeric(table(factor_variable)/ length(factor_variable))
list(factor_variable = factor_variable [1],
porc = porc,
porcentaje = paste(round(porc,0)* 100,%))
}
,by =second_factor_variable,third_factor_variable]
/ pre>

I'm looking for a simpler way to aggregate and calculate percentages of a numerical variable using data.table. The following code outputs the desired result, my question is if there is a better way to get the same result. I'm not really familiarized with the package, so any tips would be useful.

I'd like to have the following columns:

   second_factor_variable third_factor_variable factor_variable       porc porcentaje
1:                   HIGH                     C           > 200 0.04456544        4 %
2:                    LOW                     A        51 - 100 0.31739130       32 %
3:                    LOW                     A       101 - 200 0.68260870       68 %
4:                    LOW                     A         26 - 50 0.00000000        0 %

Where porc is the numerical percentage and porcentage would be the percentage rounded to be used as a label in a ggplot call.

library("ggplot2")
library("scales")
library("data.table")

### Generate some data
set.seed(123)
df <- data.frame(x = rnorm(10000, mean = 100, sd = 50))
df <- subset(df, x > 0)

df$factor_variable <- cut(df$x, right = TRUE, 
                          breaks = c(0, 25, 50, 100, 200, 100000),
                          labels = c("0 - 25", "26 - 50", "51 - 100", "101 - 200", "> 200")
                          )

df$second_factor_variable <- cut(df$x, right = TRUE, 
                                 breaks = c(0, 100, 100000),
                                 labels = c("LOW", "HIGH")
                                 )

df$third_factor_variable <- cut(df$x, right = TRUE, 
                                 breaks = c(0, 50, 100, 100000),
                                 labels = c("A", "B","C")
                                )

str(df)

### Aggregate
DT <- data.table(df)
dt = DT[, list(factor_variable = unique(DT$factor_variable),
              porc = as.numeric(table(factor_variable)/length(factor_variable)),
              porcentaje = paste( round( as.numeric(table(factor_variable)/length(factor_variable), 0 ) * 100 ), "%")
              ), by="second_factor_variable,third_factor_variable"]

EDIT

I've tried agstudy's solution grouping by with just one variable, and I believe it didn't work for producing the labels (porcentaje column). In the real dataset, I ended up having a similar issue and I can't spot whats wrong about this function.

grp <- function(factor_variable) {
  porc = as.numeric(table(factor_variable)/length(factor_variable))
  list(factor_variable = factor_variable[1],
       porc =porc,
       porcentaje = paste( round( porc, 0 ) * 100 , "%"))
}

DT[, grp(factor_variable) , by="second_factor_variable"]

The numerical values are correct

DT2 <- DT[DT$second_factor_variable %in% "LOW"]
table(DT2$factor_variable)/length(DT2$factor_variable)

I believe the same problems appears if i group by with 2 factor variables:

DT[, grp(factor_variable) , by="second_factor_variable,third_factor_variable"]

解决方案

2 changes : factorize porc variable and don't use DT to compute factor_variable

DT[, {   porc = as.numeric(table(factor_variable)/length(factor_variable))
         list(factor_variable = factor_variable[1],
               porc =porc,
               porcentaje = paste( round( porc, 0 ) * 100 , "%"))
        }
, by="second_factor_variable,third_factor_variable"]

这篇关于使用data.table聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆