在列的子集上执行dplyr mutate [英] Performing dplyr mutate on subset of columns

查看:100
本文介绍了在列的子集上执行dplyr mutate的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的数据框(真正的数据集有更多的行和列)

I have a data.frame such as this (the real data set has many more rows and columns)

set.seed(15)
dd <- data.frame(id=letters[1:4], matrix(runif(5*4), nrow=4))

#   id        X1        X2        X3        X4        X5
# 1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437
# 2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670
# 3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871
# 4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125

我想要能够编写一个dplyr语句,我可以在其中选择列的一个子集并使它们变异。 (我正在做类似于在data.table中使用.SDcols的东西。)

I would like to be able to write a dplyr statement where I can select a subset of columns and mutate them. (I'm trying to do something similar to using .SDcols in data.table).

对于一个简化的例子,这里是我想要写的功能为了保留所有其他列,添加偶数X列的和的平均值。使用基础R的所需输出是

For a simplified example, here's the function I would like to be able to write to add columns for the sums and means of the even "X" columns while preserving all other columns. The desired output using base R is

(cols<-paste0("X", c(2,4)))
# [1] "X2" "X4"
cbind(dd,evensum=rowSums(dd[,cols]),evenmean=rowMeans(dd[,cols]))

#   id        X1        X2        X3        X4        X5   evensum  evenmean
# 1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
# 2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
# 3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
# 4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768

但我想使用一个dplyr般的链子做同样的事情。在一般情况下,我希望能够使用任何 select()的帮助函数,例如 starts_with ends_with 匹配等等和任何功能。这是我尝试的

but I wanted to use a dplyr-like chain to do the same thing. In the general case, I'd like to be able to use any of select()'s helper functions such as starts_with, ends_with, matches, etc and any function. Here's what I tried

library(dplyr)
partial_mutate1 <- function(x, colspec, ...) {
    select_(x, .dots=list(lazyeval::lazy(colspec))) %>% 
    transmute_(.dots=lazyeval::lazy_dots(...)) %>% 
    cbind(x,.)
}

dd %>% partial_mutate1(num_range("X", c(2,4)), 
    evensum=rowSums(.), evenmean=rowMeans(.))

但是,这会抛出一个错误, / p>

However, This throws an error that says

Error in rowSums(.) : 'x' must be numeric

这似乎是因为似乎是指整个date.frame而不是所选择的子集。 (与 rowSums(dd)相同的错误)。但是,请注意,这会产生所需的输出

Which appears to be because . seems to be referring to the entire date.frame rather than the selected subset. (same error as rowSums(dd)). However, note that this produces the desired output

partial_mutate2 <- function(x, colspec) {
    select_(x, .dots=list(lazyeval::lazy(colspec))) %>% 
    transmute(evensum=rowSums(.), evenmean=rowMeans(.)) %>% 
    cbind(x,.)
}
dd %>% partial_mutate2(seq(2,ncol(dd),2))

我猜这是某种环境问题?有关如何将参数传递给 partial_mutate1 的建议,以便将正确地从select() -ed数据集?

I'm guessing this is some sort of environment problem? Any suggestions on how to pass the arguments to partial_mutate1 so that the . will correctly take values from the "select()-ed" dataset?

推荐答案

我错过了一些事情,或者会按预期工作:

Am I missing something or would this work as expected:

cols <- paste0("X", c(2,4))
dd %>% mutate(evensum = rowSums(.[cols]), evenmean = rowMeans(.[cols]))
#  id        X1        X2        X3        X4        X5   evensum  evenmean
#1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
#2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
#3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
#4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768

或者您是否正在寻找一个自定义函数来执行此操作?

Or are you specifically looking for a custom function to do this?

不是你正在寻找,但如果你想要的要在管道内执行,您可以使用 mutate 中明确选择,如下所示:

Not exactly what you are looking for but if you want to do it inside a pipe you could use select explicitly inside mutate like this:

dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% rowSums)
#  id        X1        X2        X3        X4        X5        xy
#1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623
#2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878
#3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071
#4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535

然而,这有点复杂如果你想应用几个功能。你可以使用一个帮助函数(..没有彻底的测试..):

However, it is a bit more complicated if you want to apply several functions. You could use a helper function along the lines of (..not thoroughly tested.. ):

f <- function(x, ...) {
  n <- nrow(x)
  x <- lapply(list(...), function(y) if (length(y) == 1L) rep(y, n) else y)
  matrix(unlist(x), nrow = n, byrow = FALSE)
}

然后应用如下:

dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% f(., rowSums(.), max(.)))
#  id        X1        X2        X3        X4        X5      xy.1      xy.2
#1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.9888592
#2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.9888592
#3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.9888592
#4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.9888592

这篇关于在列的子集上执行dplyr mutate的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆