在列的子集上执行dplyr mutate [英] Performing dplyr mutate on subset of columns
问题描述
我有一个这样的数据框(真正的数据集有更多的行和列)
I have a data.frame such as this (the real data set has many more rows and columns)
set.seed(15)
dd <- data.frame(id=letters[1:4], matrix(runif(5*4), nrow=4))
# id X1 X2 X3 X4 X5
# 1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437
# 2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670
# 3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871
# 4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125
我想要能够编写一个dplyr语句,我可以在其中选择列的一个子集并使它们变异。 (我正在做类似于在data.table中使用.SDcols的东西。)
I would like to be able to write a dplyr statement where I can select a subset of columns and mutate them. (I'm trying to do something similar to using .SDcols in data.table).
对于一个简化的例子,这里是我想要写的功能为了保留所有其他列,添加偶数X列的和的平均值。使用基础R的所需输出是
For a simplified example, here's the function I would like to be able to write to add columns for the sums and means of the even "X" columns while preserving all other columns. The desired output using base R is
(cols<-paste0("X", c(2,4)))
# [1] "X2" "X4"
cbind(dd,evensum=rowSums(dd[,cols]),evenmean=rowMeans(dd[,cols]))
# id X1 X2 X3 X4 X5 evensum evenmean
# 1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
# 2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
# 3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
# 4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768
但我想使用一个dplyr般的链子做同样的事情。在一般情况下,我希望能够使用任何 select()
的帮助函数,例如 starts_with
, ends_with
,匹配
等等和任何功能。这是我尝试的
but I wanted to use a dplyr-like chain to do the same thing. In the general case, I'd like to be able to use any of select()
's helper functions such as starts_with
, ends_with
, matches
, etc and any function. Here's what I tried
library(dplyr)
partial_mutate1 <- function(x, colspec, ...) {
select_(x, .dots=list(lazyeval::lazy(colspec))) %>%
transmute_(.dots=lazyeval::lazy_dots(...)) %>%
cbind(x,.)
}
dd %>% partial_mutate1(num_range("X", c(2,4)),
evensum=rowSums(.), evenmean=rowMeans(.))
但是,这会抛出一个错误, / p>
However, This throws an error that says
Error in rowSums(.) : 'x' must be numeric
这似乎是因为。
似乎是指整个date.frame而不是所选择的子集。 (与 rowSums(dd)
相同的错误)。但是,请注意,这会产生所需的输出
Which appears to be because .
seems to be referring to the entire date.frame rather than the selected subset. (same error as rowSums(dd)
). However, note that this produces the desired output
partial_mutate2 <- function(x, colspec) {
select_(x, .dots=list(lazyeval::lazy(colspec))) %>%
transmute(evensum=rowSums(.), evenmean=rowMeans(.)) %>%
cbind(x,.)
}
dd %>% partial_mutate2(seq(2,ncol(dd),2))
我猜这是某种环境问题?有关如何将参数传递给 partial_mutate1
的建议,以便。
将正确地从select() -ed数据集?
I'm guessing this is some sort of environment problem? Any suggestions on how to pass the arguments to partial_mutate1
so that the .
will correctly take values from the "select()-ed" dataset?
推荐答案
我错过了一些事情,或者会按预期工作:
Am I missing something or would this work as expected:
cols <- paste0("X", c(2,4))
dd %>% mutate(evensum = rowSums(.[cols]), evenmean = rowMeans(.[cols]))
# id X1 X2 X3 X4 X5 evensum evenmean
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768
或者您是否正在寻找一个自定义函数来执行此操作?
Or are you specifically looking for a custom function to do this?
不是你正在寻找,但如果你想要的要在管道内执行,您可以使用在
,如下所示: mutate
中明确选择
Not exactly what you are looking for but if you want to do it inside a pipe you could use select
explicitly inside mutate
like this:
dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% rowSums)
# id X1 X2 X3 X4 X5 xy
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535
然而,这有点复杂如果你想应用几个功能。你可以使用一个帮助函数(..没有彻底的测试..):
However, it is a bit more complicated if you want to apply several functions. You could use a helper function along the lines of (..not thoroughly tested.. ):
f <- function(x, ...) {
n <- nrow(x)
x <- lapply(list(...), function(y) if (length(y) == 1L) rep(y, n) else y)
matrix(unlist(x), nrow = n, byrow = FALSE)
}
然后应用如下:
dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% f(., rowSums(.), max(.)))
# id X1 X2 X3 X4 X5 xy.1 xy.2
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.9888592
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.9888592
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.9888592
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.9888592
这篇关于在列的子集上执行dplyr mutate的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!