使用data.table聚合 [英] Aggregate using data.table
问题描述
我正在寻找一种更简单的方法来汇总和计算使用 data.table
的数值变量的百分比。
以下代码输出所需的结果,我的问题是如果有一个更好的方法来获得相同的结果。我不是真的熟悉的包,所以任何提示将是有用的。
我想拥有以下列:
second_factor_variable third_factor_variable factor_variable porc porcentaje
1:HIGH C> 200 0.04456544 4%
2:低A 51 - 100 0.31739130 32%
3:低A 101 - 200 0.68260870 68%
4:低A 26 - 50 0.00000000 0%
其中 porc 是数字百分比, porcentage
库(ggplot2)
库(ggplot2 scale)
库(data.table)
###生成一些数据
set.seed(123)
df< - data.frame (x = rnorm(10000,mean = 100,sd = 50))
df <子集(df,x> 0)
df $ factor_variable < $ x,right = TRUE,
breaks = c(0,25,50,100,200,100000),
labels = c(0 - 25,26 - 50,51 - 100,101 - 200,> 200)
)
df $ second_factor_variable = c(0,100,100000),
labels = c(LOW,HIGH)
)
df $ third_factor_variable< - cut ,right = TRUE,
breaks = c(0,50,100,100000),
labels = c(A,B,C)
)$ b b
str(df)
### Aggregate
DT < - data.table(df)
dt = DT [,list(factor_variable = unique DT $ factor_variable),
porc = as.numeric(table(factor_variable)/ length(factor_variable)),
porcentaje = paste(round(as.numeric 0)* 100),%)
),by =second_factor_variable,third_factor_variable]
EDIT
我试过用一个变量的agstudy解决方案分组,我相信它没有生产标签(porcentaje列)。在实际数据集中,我最终遇到了类似的问题,我不能发现这个函数的错误。
grp< ; - function(factor_variable){
porc = as.numeric(table(factor_variable)/ length(factor_variable))
list(factor_variable = factor_variable [1],
porc = porc,
porcentaje = paste(round(porc,0)* 100,%))
}
DT [,grp(factor_variable),by =second_factor_variable]
数值是正确的
DT2 <-DT [DT $ second_factor_variable%in%LOW]
/ pre>
表(DT2 $ factor_variable)/长度(DT2 $ factor_variable)
我相信如果我用2个因子变量分组,会出现相同的问题:
DT [,grp(factor_variable),by =second_factor_variable,third_factor_variable]
解决方案2更改:factorize porc 变量,不使用DT计算factor_variable
DT [,{porc = as.numeric(table(factor_variable)/ length(factor_variable))
/ pre>
list(factor_variable = factor_variable [1],
porc = porc,
porcentaje = paste(round(porc,0)* 100,%))
}
,by =second_factor_variable,third_factor_variable]
I'm looking for a simpler way to aggregate and calculate percentages of a numerical variable using
data.table
. The following code outputs the desired result, my question is if there is a better way to get the same result. I'm not really familiarized with the package, so any tips would be useful.I'd like to have the following columns:
second_factor_variable third_factor_variable factor_variable porc porcentaje 1: HIGH C > 200 0.04456544 4 % 2: LOW A 51 - 100 0.31739130 32 % 3: LOW A 101 - 200 0.68260870 68 % 4: LOW A 26 - 50 0.00000000 0 %
Where porc is the numerical percentage and porcentage would be the percentage rounded to be used as a label in a ggplot call.
library("ggplot2") library("scales") library("data.table") ### Generate some data set.seed(123) df <- data.frame(x = rnorm(10000, mean = 100, sd = 50)) df <- subset(df, x > 0) df$factor_variable <- cut(df$x, right = TRUE, breaks = c(0, 25, 50, 100, 200, 100000), labels = c("0 - 25", "26 - 50", "51 - 100", "101 - 200", "> 200") ) df$second_factor_variable <- cut(df$x, right = TRUE, breaks = c(0, 100, 100000), labels = c("LOW", "HIGH") ) df$third_factor_variable <- cut(df$x, right = TRUE, breaks = c(0, 50, 100, 100000), labels = c("A", "B","C") ) str(df) ### Aggregate DT <- data.table(df) dt = DT[, list(factor_variable = unique(DT$factor_variable), porc = as.numeric(table(factor_variable)/length(factor_variable)), porcentaje = paste( round( as.numeric(table(factor_variable)/length(factor_variable), 0 ) * 100 ), "%") ), by="second_factor_variable,third_factor_variable"]
EDIT
I've tried agstudy's solution grouping by with just one variable, and I believe it didn't work for producing the labels (porcentaje column). In the real dataset, I ended up having a similar issue and I can't spot whats wrong about this function.
grp <- function(factor_variable) { porc = as.numeric(table(factor_variable)/length(factor_variable)) list(factor_variable = factor_variable[1], porc =porc, porcentaje = paste( round( porc, 0 ) * 100 , "%")) } DT[, grp(factor_variable) , by="second_factor_variable"]
The numerical values are correct
DT2 <- DT[DT$second_factor_variable %in% "LOW"] table(DT2$factor_variable)/length(DT2$factor_variable)
I believe the same problems appears if i group by with 2 factor variables:
DT[, grp(factor_variable) , by="second_factor_variable,third_factor_variable"]
解决方案2 changes : factorize porc variable and don't use DT to compute factor_variable
DT[, { porc = as.numeric(table(factor_variable)/length(factor_variable)) list(factor_variable = factor_variable[1], porc =porc, porcentaje = paste( round( porc, 0 ) * 100 , "%")) } , by="second_factor_variable,third_factor_variable"]
这篇关于使用data.table聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!