在对特定行进行条件调整时动态地对多列进行突变 [英] Mutating multiple columns dynamically while conditioning on specific rows

查看:77
本文介绍了在对特定行进行条件调整时动态地对多列进行突变的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这里有几个类似的问题,但是似乎没有一个问题可以解决我遇到的确切问题。

I know there are several similar questions around here, but none of them seems to address the precise issue I'm having.

set.seed(4)
df = data.frame(
  Key = c("A", "B", "A", "D", "A"),
  Val1 = rnorm(5),
  Val2 = runif(5),
  Val3 = 1:5
)

我想将其中Key == A
的行的值列的值归零,该列名称通过 grep

I want to zeroise values of the value columns for the rows where Key == "A" The column names are referenced through a grep:

cols = grep("Val", names(df), value = TRUE)

在这种情况下,通常要实现我想要的功能,我会使用 data.table 像这样:

Normally to achieve what I want in this case I would use data.table like this:

library(data.table)
df = as.data.table(df)
df[Key == "A", (cols) := 0]

并且期望的输出是这样的

And the desired output is like this:

  Key      Val1       Val2 Val3
1   A  0.000000 0.00000000    0
2   B -1.383814 0.55925762    2
3   A  0.000000 0.00000000    0
4   D  1.437151 0.05632773    4
5   A  0.000000 0.00000000    0

但是这一次我需要使用 dplyr ,因为我正在一个团队项目中,每个人都使用它。我刚刚提供的数据是说明性的,我的真实数据是> 5m行,其中有16个值列需要更新。我唯一能想到的解决方案是使用 mutate_at 像这样:

However this time I need to use dplyr as I am working on a team project where everyone uses it. The data I just provided is illustrative and my real data is >5m rows with 16 value columns to be updated. The only solution I could come up with is using mutate_at like this:

df %>% mutate_at(.vars = vars(cols), .funs = function(x) ifelse(df$Key == "A", 0, x))

但是,这对我的真实数据似乎非常慢。我希望找到一个更优雅,更重要的是更快的解决方案。

However, this seems to be extremely slow on my real data. I was hoping to find a solution which is more elegant and, more importantly, faster.

我已经尝试过使用 map ,使用 !! 取消引用,使用 get := (令人讨厌的是它可以被data.table中的:= 掩盖)等,但是我认为我对这些工作原理的理解还不够深入,无法构建一个有效的解决方案。

I have tried many combinations using map, unquoting using !!, using get and := (which annoyingly can get masked by the := in data.table) etc, but I think my understanding of how these work is simply not deep enough to construct a valid solution.

推荐答案

使用此dplyr命令,

With this dplyr command,

df %>% mutate_at(.vars = vars(cols), .funs = function(x) ifelse(df$Key == "A", 0, x))

您实际上是在评估df $ Key == A,n次,其中n =您的列数

You are actually evaluating the statement df$Key == "A", n times, where n=the number of columns you have.

一种解决方法是预先定义要更改的行:

One work around is to pre-define the rows you want to change:

idx = which(DF$Key=="A")
DF %>% mutate_at(.vars = vars(cols), .funs = function(x){x[idx]=0;x})

更干净@IceCreamToucan正确指出的一种方式(请参阅下面的注释)是使用replace函数,同时向其传递额外的参数:

A cleaner and better way, correctly pointed out by @IceCreamToucan (see comments below), is to use the function replace, while passing it the extra parameters:

DF %>% mutate_at(.vars = vars(cols), replace, DF$Key == 'A', 0)

我们可以对所有这些方法进行测试,我认为dplyr和data.table是可比较的。

We can put all these approaches to test, and I think dplyr and data.table are comparable.

#simulate data
set.seed(100)
Key = sample(LETTERS[1:3],1000000,replace=TRUE)
DF = as.data.frame(data.frame(Key,matrix(runif(1000000*10),nrow=1000000,ncol=10)))
DT = as.data.table(DF)

cols = grep("[35789]", names(DF), value = TRUE)

#long method
system.time(DF %>% mutate_at(.vars = vars(cols), .funs = function(x) ifelse(DF$Key == "A", 0, x)))
user  system elapsed 
  0.121   0.035   0.156 

#old base R way
system.time(DF[idx,cols] <- 0)
   user  system elapsed 
  0.085   0.021   0.106 

#dplyr
# define function
func = function(){
       idx = which(DF$Key=="A")
       DF %>% mutate_at(.vars = vars(cols), .funs = function(x){x[idx]=0;x})
}
system.time(func())
user  system elapsed 
  0.020   0.006   0.026

#data.table
system.time(DT[Key=="A", (cols) := 0])
   user  system elapsed 
  0.012   0.001   0.013 
#replace with dplyr
system.time(DF %>% mutate_at(.vars = vars(cols), replace, DF$Key == 'A', 0))
user  system elapsed 
  0.007   0.001   0.008

这篇关于在对特定行进行条件调整时动态地对多列进行突变的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆