如何使用列的范围而不是pmax/pmin的名称 [英] How to use a range for columns instead of names for pmax / pmin

查看:35
本文介绍了如何使用列的范围而不是pmax/pmin的名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在pmax/pmin中使用一系列列,而不是键入所有列的名称.

I want to use a range of columns in pmax/pmin instead of typing names of all columns.

#sample data
foo <- data.frame(sapply(letters, function(x) x = sample(1:10,5)))

#this works
bar <- foo %>% 
    mutate(maxcol = pmax(a,b,c))

# this does not work
bar <- foo %>% 
    mutate(maxcol = pmax(a:z))

最终我也想要这样的东西

Ultimately I also want something like this

bar <- foo %>% 
    mutate_at(a:z = pmax(a:z))

推荐答案

这里有一个选项,可以一次对所有行,所有列进行一个函数调用./p>

Here's an option that does one function-call on all rows, all columns at once.

foo %>%
  mutate(maxcol = do.call(pmax, subset(., select = a:e)))
#    a  b c d e  f g  h  i j  k l m  n  o p q  r  s t u  v w  x  y z maxcol
# 1  1  4 9 2 4  4 1 10  2 3 10 4 7  1 10 9 8  2  8 9 5  1 9  1 10 9      9
# 2  5  2 5 3 5  2 8  8  5 8  2 3 6 10  9 3 5  8  7 4 6  9 8  5  8 3      5
# 3 10  9 6 1 7 10 6  4  4 7  6 6 2  7  5 5 4  1 10 7 3 10 5 10  1 7     10
# 4  8  1 4 8 9  3 3  9 10 1  8 5 8  4  4 8 6 10  5 2 9  5 7  7  3 1      9
# 5  2 10 2 9 8  9 9  6  7 5  9 2 5  5  7 4 2  5  4 8 4  6 6  2  9 6     10

您可以使用冒号来选择部分或全部列,甚至可以选择任意列:

You can select some or all of the columns using the colon notation, even arbitrary columns:

foo %>%
  mutate(maxcol = do.call(pmax, subset(., select = c(a:e,g))))
#    a  b c d e  f g  h  i j  k l m  n  o p q  r  s t u  v w  x  y z maxcol
# 1  1  4 9 2 4  4 1 10  2 3 10 4 7  1 10 9 8  2  8 9 5  1 9  1 10 9      9
# 2  5  2 5 3 5  2 8  8  5 8  2 3 6 10  9 3 5  8  7 4 6  9 8  5  8 3      8
# 3 10  9 6 1 7 10 6  4  4 7  6 6 2  7  5 5 4  1 10 7 3 10 5 10  1 7     10
# 4  8  1 4 8 9  3 3  9 10 1  8 5 8  4  4 8 6 10  5 2 9  5 7  7  3 1      9
# 5  2 10 2 9 8  9 9  6  7 5  9 2 5  5  7 4 2  5  4 8 4  6 6  2  9 6     10

应优先于其他答案(通常使用所谓的惯用方法)的原因是:

The reason this should be preferred over the other answers (which are generally using allegedly idiomatic methods) is because:

  • 在Dom的答案中, max 函数对于帧的每一行都被调用一次;R的向量化操作未使用,效率低下,应尽可能避免;
  • 在akrun的答案中, pmax 对于帧的每一列都被调用一次,在这种情况下,这听起来可能更糟,但实际上更接近于最好的情况.我的答案与akrun最接近,因为我们正在 mutate 中选择 select 数据.
  • in Dom's answer, the max function is called once for each row of the frame; R's vectorized ops are not being used, this is inefficient and should be avoided if possible;
  • in akrun's answer, pmax is being called once for each column of the frame, which in this case might sound worse but actually closer to the best one can do. My answer is closest to akrun's in that we are selecting data within the mutate.

如果您希望在 base :: subset 上使用 dplyr :: select ,则需要将其分解为

If you'd prefer to use dplyr::select over base::subset, it needs to be broken out as

foo %>%
  mutate(maxcol = select(., a:e, g) %>% do.call(pmax, .))

我认为通过基准测试可以更好地证明这一点.使用提供的5x26帧,我们可以看到明显的改进:

I think this is demonstrated a little better with benchmarks. Using the provided 5x26 frame, we see a clear improvement:

set.seed(42)
foo <- data.frame(sapply(letters, function(x) x = sample(1:10,5)))
microbenchmark::microbenchmark(
  Dom = {
    foo %>% 
      rowwise() %>% 
      summarise(max= max(c_across(a:z)))
  },
  akr = {
    foo %>%
       mutate(maxcol = reduce(select(., a:z), pmax))
  },
  r2 = {
    foo %>%
      mutate(maxcol = do.call(pmax, subset(., select = a:z)))
  }
)
# Unit: milliseconds
#  expr    min      lq    mean  median      uq     max neval
#   Dom 6.6561 7.15260 7.61574 7.38345 7.90375 11.0387   100
#   akr 4.2849 4.69920 4.96278 4.86110 5.18130  7.0908   100
#    r2 2.3290 2.49285 2.68671 2.59180 2.78960  4.7086   100

让我们尝试使用稍大的5000x26:

Let's try with a slightly larger 5000x26:

set.seed(42)
foo <- data.frame(sapply(letters, function(x) x = sample(1:10,5000,replace=TRUE)))
microbenchmark::microbenchmark(
  Dom = {
    foo %>% 
      rowwise() %>% 
      summarise(max= max(c_across(a:z)))
  },
  akr = {
    foo %>%
       mutate(maxcol = reduce(select(., a:z), pmax))
  },
  r2 = {
    foo %>%
      mutate(maxcol = do.call(pmax, subset(., select = a:z)))
  }
)
# Unit: milliseconds
#  expr      min       lq      mean    median        uq       max neval
#   Dom 515.6437 563.6060 763.97348 811.45815 883.00115 1775.2366   100
#   akr   4.6660   5.1619  11.92847   5.74050   6.50625  293.7444   100
#    r2   2.9253   3.4371   4.24548   3.71845   4.27380   14.0958   100

这最后一个无疑显示了使用 rowwise 的结果.akrun的答案与该答案之间的相对性能几乎等于5行,这强化了一个前提,即列方式要好于行方式(并且一次要快于两者).

This last one definitely shows a consequence of using rowwise. The relative performance between akrun's answer and this one is almost identical to 5 rows, reinforcing the premise that column-wise is better than row-wise (and all-at-once is faster than both).

(如果确实需要,也可以使用 purrr :: invoke 来完成,尽管它不能加快速度:

(This can also be done with purrr::invoke, if truly desired, though it does not speed it up:

library(purrr)
foo %>%
  mutate(maxcol = invoke(pmax, subset(., select = a:z)))

### microbenchmark(...)
# Unit: milliseconds
#     expr    min      lq    mean  median      uq      max neval
#      Dom 7.8292 8.40275 9.02813 8.97345 9.38500  12.4368   100
#      akr 4.9622 5.28855 8.78909 5.60090 6.11790 309.2607   100
#   r2base 2.5521 2.74635 3.01949 2.90415 3.21060   4.6512   100
#  r2purrr 2.5063 2.77510 3.11206 2.93415 3.33015   5.2403   100

这篇关于如何使用列的范围而不是pmax/pmin的名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆