使用data.table聚合 [英] Aggregate using data.table

查看：145 发布时间：2017/3/12 12:28:33 r data.table

本文介绍了使用data.table聚合的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一种更简单的方法来汇总和计算使用 data.table 的数值变量的百分比。
以下代码输出所需的结果，我的问题是如果有一个更好的方法来获得相同的结果。我不是真的熟悉的包，所以任何提示将是有用的。

我想拥有以下列：

  second_factor_variable third_factor_variable factor_variable porc porcentaje 
 1：HIGH C> 200 0.04456544 4％
 2：低A 51  -  100 0.31739130 32％
 3：低A 101  -  200 0.68260870 68％
 4：低A 26  -  50 0.00000000 0％

其中 porc 是数字百分比， porcentage

 库（ggplot2）
库（ggplot2 scale）
库（data.table）
 
 ###生成一些数据
 set.seed（123）
 df<  -  data.frame （x = rnorm（10000，mean = 100，sd = 50））
 df <子集（df，x> 0）
 
 df $ factor_variable < $ x，right = TRUE，
 breaks = c（0，25，50，100，200，100000），
 labels = c（0  -  25，26  -  50，51 -  100，101  -  200，> 200）
）
 
 df $ second_factor_variable  = c（0,100,100000），
 labels = c（LOW，HIGH）
）
 
 df $ third_factor_variable<  -  cut ，right = TRUE，
 breaks = c（0，50，100，100000），
 labels = c（A，B，C）
）$ b b 
 str（df）
 
 ### Aggregate 
 DT < -  data.table（df）
 dt = DT [，list（factor_variable = unique DT $ factor_variable），
 porc = as.numeric（table（factor_variable）/ length（factor_variable）），
 porcentaje = paste（round（as.numeric 0）* 100），％）
），by =second_factor_variable，third_factor_variable]

`EDIT`

 
 
 我试过用一个变量的agstudy解决方案分组，我相信它没有生产标签（porcentaje列）。在实际数据集中，我最终遇到了类似的问题，我不能发现这个函数的错误。
  grp< ;  -  function（factor_variable）{
 porc = as.numeric（table（factor_variable）/ length（factor_variable））
 list（factor_variable = factor_variable [1]，
 porc = porc，
 porcentaje = paste（round（porc，0）* 100，％））
} 
 
 DT [，grp（factor_variable），by =second_factor_variable] 
  
数值是正确的
  DT2 <-DT [DT $ second_factor_variable％in％LOW] 
表（DT2 $ factor_variable）/长度（DT2 $ factor_variable）
  / pre> 
 
 我相信如果我用2个因子变量分组，会出现相同的问题：
  DT [，grp（factor_variable），by =second_factor_variable，third_factor_variable] 
  
 
 
解决方案
 2更改：factorize  porc 变量，不使用DT计算factor_variable 
  DT [，{porc = as.numeric（table（factor_variable）/ length（factor_variable））
 list（factor_variable = factor_variable [1]，
 porc = porc，
 porcentaje = paste（round（porc，0）* 100，％））
} 
，by =second_factor_variable，third_factor_variable] 
  / pre> 
I'm looking for a simpler way to aggregate and calculate percentages of a numerical variable using data.table.
The following code outputs the desired result, my question is if there is a better way to get the same result. I'm not really familiarized with the package, so any tips would be useful. 

I'd like to have the following columns:
   second_factor_variable third_factor_variable factor_variable       porc porcentaje
1:                   HIGH                     C           > 200 0.04456544        4 %
2:                    LOW                     A        51 - 100 0.31739130       32 %
3:                    LOW                     A       101 - 200 0.68260870       68 %
4:                    LOW                     A         26 - 50 0.00000000        0 %
Where porc is the numerical percentage and porcentage would be the percentage rounded to be used as a label in a ggplot call.
library("ggplot2")
library("scales")
library("data.table")

### Generate some data
set.seed(123)
df <- data.frame(x = rnorm(10000, mean = 100, sd = 50))
df <- subset(df, x > 0)

df$factor_variable <- cut(df$x, right = TRUE, 
                          breaks = c(0, 25, 50, 100, 200, 100000),
                          labels = c("0 - 25", "26 - 50", "51 - 100", "101 - 200", "> 200")
                          )

df$second_factor_variable <- cut(df$x, right = TRUE, 
                                 breaks = c(0, 100, 100000),
                                 labels = c("LOW", "HIGH")
                                 )

df$third_factor_variable <- cut(df$x, right = TRUE, 
                                 breaks = c(0, 50, 100, 100000),
                                 labels = c("A", "B","C")
                                )

str(df)

### Aggregate
DT <- data.table(df)
dt = DT[, list(factor_variable = unique(DT$factor_variable),
              porc = as.numeric(table(factor_variable)/length(factor_variable)),
              porcentaje = paste( round( as.numeric(table(factor_variable)/length(factor_variable), 0 ) * 100 ), "%")
              ), by="second_factor_variable,third_factor_variable"]


EDIT

I've tried agstudy's solution grouping by with just one variable, and I believe it didn't work for producing the labels (porcentaje column). In the real dataset, I ended up having a similar issue and I can't spot whats wrong about this function.
grp <- function(factor_variable) {
  porc = as.numeric(table(factor_variable)/length(factor_variable))
  list(factor_variable = factor_variable[1],
       porc =porc,
       porcentaje = paste( round( porc, 0 ) * 100 , "%"))
}

DT[, grp(factor_variable) , by="second_factor_variable"]
The numerical values are correct
DT2 <- DT[DT$second_factor_variable %in% "LOW"]
table(DT2$factor_variable)/length(DT2$factor_variable)
I believe the same problems appears if i group by with 2 factor variables:
DT[, grp(factor_variable) , by="second_factor_variable,third_factor_variable"]

 解决方案 
2 changes : factorize porc variable and don't use DT to compute factor_variable
DT[, {   porc = as.numeric(table(factor_variable)/length(factor_variable))
         list(factor_variable = factor_variable[1],
               porc =porc,
               porcentaje = paste( round( porc, 0 ) * 100 , "%"))
        }
, by="second_factor_variable,third_factor_variable"]


                        
这篇关于使用data.table聚合的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用data.table聚合 [英] Aggregate using data.table

问题描述

`EDIT`

EDIT

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用data.table聚合 [英] Aggregate using data.table

问题描述

EDIT

EDIT

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

`EDIT`

登录关闭