dplyr 变异/替换行子集上的几列 [英] dplyr mutate/replace several columns on a subset of rows

查看:13
本文介绍了dplyr 变异/替换行子集上的几列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试基于 dplyr 的工作流程(而不是主要使用我习惯的 data.table),但我遇到了一个找不到等效工作流程的问题dplyr 解决方案.我经常遇到需要根据单个条件有条件地更新/替换几列的情况.这是一些示例代码,以及我的 data.table 解决方案:

I'm in the process of trying out a dplyr-based workflow (rather than using mostly data.table, which I'm used to), and I've come across a problem that I can't find an equivalent dplyr solution to. I commonly run into the scenario where I need to conditionally update/replace several columns based on a single condition. Here's some example code, with my data.table solution:

library(data.table)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit', 
         `:=`(qty.exit = qty,
              cf = 0,
              delta.watts = 13)]

对于同样的问题,是否有简单的 dplyr 解决方案?我想避免使用 ifelse,因为我不想多次输入条件 - 这是一个简化的示例,但有时会有很多基于单个条件的分配.

Is there a simple dplyr solution to this same problem? I'd like to avoid using ifelse because I don't want to have to type the condition multiple times - this is a simplified example, but there are sometimes many assignments based on a single condition.

提前感谢您的帮助!

推荐答案

这些解决方案(1)维护管道,(2)覆盖输入,(3)只要求条件指定一次:

These solutions (1) maintain the pipeline, (2) do not overwrite the input and (3) only require that the condition be specified once:

1a) mutate_cond 为可以合并到管道中的数据帧或数据表创建一个简单的函数.这个函数类似于 mutate 但只作用于满足条件的行:

1a) mutate_cond Create a simple function for data frames or data tables that can be incorporated into pipelines. This function is like mutate but only acts on the rows satisfying the condition:

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data[condition, ] %>% mutate(...)
  .data
}

DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

1b) mutate_last 这是数据帧或数据表的替代函数,它再次类似于 mutate 但仅在 group_by 中使用(如下例所示)并且只对最后一组而不是每个组进行操作.请注意,TRUE > FALSE,因此如果 group_by 指定了一个条件,那么 mutate_last 将只对满足该条件的行进行操作.

1b) mutate_last This is an alternative function for data frames or data tables which again is like mutate but is only used within group_by (as in the example below) and only operates on the last group rather than every group. Note that TRUE > FALSE so if group_by specifies a condition then mutate_last will only operate on rows satisfying that condition.

mutate_last <- function(.data, ...) {
  n <- n_groups(.data)
  indices <- attr(.data, "indices")[[n]] + 1
  .data[indices, ] <- .data[indices, ] %>% mutate(...)
  .data
}


DF %>% 
   group_by(is.exit = measure == 'exit') %>%
   mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
   ungroup() %>%
   select(-is.exit)

2) 分解条件 分解条件,使其成为一个额外的列,稍后将其删除.然后使用 ifelsereplace 或带逻辑的算术,如图所示.这也适用于数据表.

2) factor out condition Factor out the condition by making it an extra column which is later removed. Then use ifelse, replace or arithmetic with logicals as illustrated. This also works for data tables.

library(dplyr)

DF %>% mutate(is.exit = measure == 'exit',
              qty.exit = ifelse(is.exit, qty, qty.exit),
              cf = (!is.exit) * cf,
              delta.watts = replace(delta.watts, is.exit, 13)) %>%
       select(-is.exit)

3) sqldf 我们可以通过管道中的 sqldf 包将 SQL update 用于数据帧(但不能使用数据表,除非我们转换它们——这可能代表一个dplyr 中的错误.请参阅 dplyr 问题 1579).由于 update 的存在,我们似乎不希望地修改了此代码中的输入,但实际上 update 正在临时作用于输入的副本生成的数据库,而不是实际输入.

3) sqldf We could use SQL update via the sqldf package in the pipeline for data frames (but not data tables unless we convert them -- this may represent a bug in dplyr. See dplyr issue 1579). It may seem that we are undesirably modifying the input in this code due to the existence of the update but in fact the update is acting on a copy of the input in the temporarily generated database and not on the actual input.

library(sqldf)

DF %>% 
   do(sqldf(c("update '.' 
                 set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13 
                 where measure = 'exit'", 
              "select * from '.'")))

4) row_case_when 还要查看 row_case_when 中定义的返回一个小标题:如何用 case_when? 向量化.它使用类似于 case_when 的语法,但适用于行.

4) row_case_when Also check out row_case_when defined in Returning a tibble: how to vectorize with case_when? . It uses a syntax similar to case_when but applies to rows.

library(dplyr)

DF %>%
  row_case_when(
    measure == "exit" ~ data.frame(qty.exit = qty, cf = 0, delta.watts = 13),
    TRUE ~ data.frame(qty.exit, cf, delta.watts)
  )

注意 1: 我们将其用作 DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

注意 2: dplyr 问题中也讨论了如何轻松指定更新行子集的问题 134, 631, 15181573631 为主线程,并且1573 是对这里答案的回顾.

Note 2: The problem of how to easily specify updating a subset of rows is also discussed in dplyr issues 134, 631, 1518 and 1573 with 631 being the main thread and 1573 being a review of the answers here.

这篇关于dplyr 变异/替换行子集上的几列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆