根据分组变量迭代创建列 [英] Iteratively create columns based on grouped variables

查看:61
本文介绍了根据分组变量迭代创建列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些数据(下面),我想根据一些分组变量的当前列的总和来迭代地添加列,我想将列命名为当前名称+_tot的粘贴值。我正在想一个dplyr和lapply的组合是一种方式,但我不能得到正确的结构。

  set.seed(1234)
data< - data.frame(
biz = sample c(telco,shipping,tech),50,replace = TRUE),
region = sample(c(mideast,americas),50,replace = TRUE),
june = sample(1:50,50,replace = TRUE),
july = sample(100:150,50,replace = TRUE)

所以,我想做的是1)将这些数据分组为region,然后为以下几个月的每一个添加一个新列那个月的价值的总和(在真实的数据框架中,有很多时期跟随)。



基本上,我想应用此功能

  library(dplyr )
data%>%group_by(region)%>%mutate(june_tot = sum(june))

,而不必指定六月或七月。我的初始选择:

  testfun<  -  function(df,col){
name< - paste ,_tot,sep =)
data2< - df%>%group_by(region)%>%summarize(name = sum(col))
return(data2)
}

但是,这不行,因为我必须指定要调用的列初始功能。当然,从初始函数中删除col参数也不起作用。



任何想法如何提升这种论据?

解决方案

以下是使用 dplyr 因为这是你尝试的),然后是 data.table 以及 base R 解决方案:



dplyr:



  cols<  -  lapply(names(data) (1)),as.name)
名称(cols)< - paste0(名称(数据)[ - (1:2)],_tot)
data%>% group_by(region)%>%mutate_each_q(funs(sum),cols)

假设每一列,前两个是月度数据。一行解释:


  1. 我们使用 as.name lapply 来生成我们想要的 mutate 的列名称作为符号

  2. 我们想要的新名称(即month_tot)到1的符号列表。

  3. 我们使用 mutate_each_q (称为code> mutate_each _ 在 dplyr 0.3.0.2 中)将 sum 应用于列表我们在1.和2中创建的表达式。

这是(示例)结果:

 资料来源:本地数据框架[50 x 6] 
组:区域

biz区域june july june_tot july_tot
1航运中东17 124 780 3339
2 telco americas 11 101 465 2901
3 telco mideast 27 131 780 3339
4科技美洲24 135 465 2901
...行省略



data.table:



  new.names&l t;  -  paste0(尾(名称(数据),2L),_tot)#创建新名称
data.table(data)[,
(new.names):= lapply ,sum),#lapply`sum到选定的列(.SD中的那些),并分配给`new.names`列
by = region,.SDcols = -1#group by`region` ,并排除`.SD'的第一列(注意`区域`也被排除在`by`
] []#extra`[]`只是强制打印

这里,类似的逻辑,除了我们使用特殊的 .SD 对象代表我们不分组的 data.table 中的每一列。



base:



  do.call(
cbind,
列表(
数据,
setNames(
(double(数据[ - (1:2)],函数(x)ave(x,data $ region,FUN = sum)),
paste0(名称(数据[ - (1:2)]) _tot)
)))

这里我们使用 ave 来计算每个区域的和,使用 lapply ave 应用于每一列,并使用 do.call(cbind,。 ..)重建最终的数据框。


I've got some data (below) where I want to iteratively add columns based on sums of current columns by some grouping variable, and I want to name the columns a pasted value of the current name + "_tot". I'm thinking a combination of dplyr and lapply is the way to go about it but I can't get the structure correct.

set.seed(1234)
data <- data.frame(
    biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
    region = sample(c("mideast","americas"), 50, replace = TRUE),
    june = sample(1:50, 50, replace=TRUE),
    july = sample(100:150, 50, replace=TRUE)
    )

So, what I want to do is 1) group this data by "region", then add a new column for each of the following months that is the sum of that month's value (in the real dataframe, there are many periods that follow).

Basically, I want to apply this function

library(dplyr)
data %>% group_by(region) %>% mutate(june_tot = sum(june))

across every month, without having to specify "june" or "july". My initial take:

testfun <- function(df, col) {
    name <- paste(col, "_tot", sep="")
    data2 <- df %>% group_by(region) %>% summarise(name=sum(col))
    return(data2)
}

but lapplying this doesn't work, because I have to specify the columns to call into the initial function. Just removing the "col" argument from the initial function doesn't work either, of course.

Any ideas how to lapply this sort of argument?

解决方案

Here are possible solutions to your problems using dplyr (first, since that is what you tried), and followed by data.table as well as base R solutions:

dplyr:

cols <- lapply(names(data)[-(1:2)], as.name)
names(cols) <- paste0(names(data)[-(1:2)], "_tot")
data %>% group_by(region) %>% mutate_each_q(funs(sum), cols)

Assumes every column but the first two are monthly data. An explanation by line:

  1. we use as.name and lapply to generate a list of the columns names we want to mutate as symbols
  2. we give the new names we want (i.e. month_tot) to the list of symbols from 1.
  3. we use the mutate_each_q (known as mutate_each_ in dplyr 0.3.0.2) to apply sum to the list of expressions we created in 1. and 2.

This is the (sample) result:

Source: local data frame [50 x 6]
Groups: region

        biz   region june july june_tot july_tot
1  shipping  mideast   17  124      780     3339
2     telco americas   11  101      465     2901
3     telco  mideast   27  131      780     3339
4      tech americas   24  135      465     2901
... rows omitted

data.table:

new.names <- paste0(tail(names(data), 2L), "_tot")  # Make new names
data.table(data)[,
  (new.names):=lapply(.SD, sum),    # `lapply` `sum` to the selected columns (those in .SD), and assign to `new.names` columns
  by=region, .SDcols=-1             # group by `region`, and exclude first column from `.SD` (note `region` is excluded as well by reason of being in `by`
][]                                 # extra `[]` just to force printing

Here, similar logic, except we use the special .SD object that represents every column in the data.table that we are not grouping by.

base:

do.call(
  cbind, 
  list(
    data, 
    setNames(
      lapply(data[-(1:2)], function(x) ave(x, data$region, FUN=sum)),
      paste0(names(data[-(1:2)]), "_tot")
) ) )

Here we use ave to compute the per region sums, use lapply to apply ave to each column, and use do.call(cbind, ...) to reconstruct the final data frame.

这篇关于根据分组变量迭代创建列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆