根据分组变量迭代创建列 [英] Iteratively create columns based on grouped variables
问题描述
set.seed(1234)
data< - data.frame(
biz = sample c(telco,shipping,tech),50,replace = TRUE),
region = sample(c(mideast,americas),50,replace = TRUE),
june = sample(1:50,50,replace = TRUE),
july = sample(100:150,50,replace = TRUE)
)
所以,我想做的是1)将这些数据分组为region,然后为以下几个月的每一个添加一个新列那个月的价值的总和(在真实的数据框架中,有很多时期跟随)。
基本上,我想应用此功能
library(dplyr )
data%>%group_by(region)%>%mutate(june_tot = sum(june))
,而不必指定六月或七月。我的初始选择:
testfun< - function(df,col){
name< - paste ,_tot,sep =)
data2< - df%>%group_by(region)%>%summarize(name = sum(col))
return(data2)
}
但是,这不行,因为我必须指定要调用的列初始功能。当然,从初始函数中删除col参数也不起作用。
任何想法如何提升这种论据?
以下是使用 dplyr
因为这是你尝试的),然后是 data.table
以及 base R
解决方案:
dplyr:
cols< - lapply(names(data) (1)),as.name)
名称(cols)< - paste0(名称(数据)[ - (1:2)],_tot)
data%>% group_by(region)%>%mutate_each_q(funs(sum),cols)
假设每一列,前两个是月度数据。一行解释:
- 我们使用
as.name
和lapply
来生成我们想要的mutate
的列名称作为符号 - 我们想要的新名称(即month_tot)到1的符号列表。
- 我们使用
mutate_each_q
(称为code> mutate_each _ 在dplyr 0.3.0.2
中)将sum
应用于列表我们在1.和2中创建的表达式。
这是(示例)结果:
资料来源:本地数据框架[50 x 6]
组:区域
biz区域june july june_tot july_tot
1航运中东17 124 780 3339
2 telco americas 11 101 465 2901
3 telco mideast 27 131 780 3339
4科技美洲24 135 465 2901
...行省略
data.table:
new.names&l t; - paste0(尾(名称(数据),2L),_tot)#创建新名称
data.table(data)[,
(new.names):= lapply ,sum),#lapply`sum到选定的列(.SD中的那些),并分配给`new.names`列
by = region,.SDcols = -1#group by`region` ,并排除`.SD'的第一列(注意`区域`也被排除在`by`
] []#extra`[]`只是强制打印
这里,类似的逻辑,除了我们使用特殊的 .SD
对象代表我们不分组的 data.table
中的每一列。
base:
do.call(
cbind,
列表(
数据,
setNames(
(double(数据[ - (1:2)],函数(x)ave(x,data $ region,FUN = sum)),
paste0(名称(数据[ - (1:2)]) _tot)
)))
这里我们使用 ave
来计算每个区域的和,使用 lapply
将 ave
应用于每一列,并使用 do.call(cbind,。 ..)
重建最终的数据框。
I've got some data (below) where I want to iteratively add columns based on sums of current columns by some grouping variable, and I want to name the columns a pasted value of the current name + "_tot". I'm thinking a combination of dplyr and lapply is the way to go about it but I can't get the structure correct.
set.seed(1234)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas"), 50, replace = TRUE),
june = sample(1:50, 50, replace=TRUE),
july = sample(100:150, 50, replace=TRUE)
)
So, what I want to do is 1) group this data by "region", then add a new column for each of the following months that is the sum of that month's value (in the real dataframe, there are many periods that follow).
Basically, I want to apply this function
library(dplyr)
data %>% group_by(region) %>% mutate(june_tot = sum(june))
across every month, without having to specify "june" or "july". My initial take:
testfun <- function(df, col) {
name <- paste(col, "_tot", sep="")
data2 <- df %>% group_by(region) %>% summarise(name=sum(col))
return(data2)
}
but lapplying this doesn't work, because I have to specify the columns to call into the initial function. Just removing the "col" argument from the initial function doesn't work either, of course.
Any ideas how to lapply this sort of argument?
Here are possible solutions to your problems using dplyr
(first, since that is what you tried), and followed by data.table
as well as base R
solutions:
dplyr:
cols <- lapply(names(data)[-(1:2)], as.name)
names(cols) <- paste0(names(data)[-(1:2)], "_tot")
data %>% group_by(region) %>% mutate_each_q(funs(sum), cols)
Assumes every column but the first two are monthly data. An explanation by line:
- we use
as.name
andlapply
to generate a list of the columns names we want tomutate
as symbols - we give the new names we want (i.e. month_tot) to the list of symbols from 1.
- we use the
mutate_each_q
(known asmutate_each_
indplyr 0.3.0.2
) to applysum
to the list of expressions we created in 1. and 2.
This is the (sample) result:
Source: local data frame [50 x 6]
Groups: region
biz region june july june_tot july_tot
1 shipping mideast 17 124 780 3339
2 telco americas 11 101 465 2901
3 telco mideast 27 131 780 3339
4 tech americas 24 135 465 2901
... rows omitted
data.table:
new.names <- paste0(tail(names(data), 2L), "_tot") # Make new names
data.table(data)[,
(new.names):=lapply(.SD, sum), # `lapply` `sum` to the selected columns (those in .SD), and assign to `new.names` columns
by=region, .SDcols=-1 # group by `region`, and exclude first column from `.SD` (note `region` is excluded as well by reason of being in `by`
][] # extra `[]` just to force printing
Here, similar logic, except we use the special .SD
object that represents every column in the data.table
that we are not grouping by.
base:
do.call(
cbind,
list(
data,
setNames(
lapply(data[-(1:2)], function(x) ave(x, data$region, FUN=sum)),
paste0(names(data[-(1:2)]), "_tot")
) ) )
Here we use ave
to compute the per region sums, use lapply
to apply ave
to each column, and use do.call(cbind, ...)
to reconstruct the final data frame.
这篇关于根据分组变量迭代创建列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!