如何使用列的范围而不是pmax/pmin的名称 [英] How to use a range for columns instead of names for pmax / pmin
问题描述
我想在pmax/pmin中使用一系列列,而不是键入所有列的名称.
I want to use a range of columns in pmax/pmin instead of typing names of all columns.
#sample data
foo <- data.frame(sapply(letters, function(x) x = sample(1:10,5)))
#this works
bar <- foo %>%
mutate(maxcol = pmax(a,b,c))
# this does not work
bar <- foo %>%
mutate(maxcol = pmax(a:z))
最终我也想要这样的东西
Ultimately I also want something like this
bar <- foo %>%
mutate_at(a:z = pmax(a:z))
推荐答案
这里有一个选项,可以一次对所有行,所有列进行一个函数调用./p>
Here's an option that does one function-call on all rows, all columns at once.
foo %>%
mutate(maxcol = do.call(pmax, subset(., select = a:e)))
# a b c d e f g h i j k l m n o p q r s t u v w x y z maxcol
# 1 1 4 9 2 4 4 1 10 2 3 10 4 7 1 10 9 8 2 8 9 5 1 9 1 10 9 9
# 2 5 2 5 3 5 2 8 8 5 8 2 3 6 10 9 3 5 8 7 4 6 9 8 5 8 3 5
# 3 10 9 6 1 7 10 6 4 4 7 6 6 2 7 5 5 4 1 10 7 3 10 5 10 1 7 10
# 4 8 1 4 8 9 3 3 9 10 1 8 5 8 4 4 8 6 10 5 2 9 5 7 7 3 1 9
# 5 2 10 2 9 8 9 9 6 7 5 9 2 5 5 7 4 2 5 4 8 4 6 6 2 9 6 10
您可以使用冒号来选择部分或全部列,甚至可以选择任意列:
You can select some or all of the columns using the colon notation, even arbitrary columns:
foo %>%
mutate(maxcol = do.call(pmax, subset(., select = c(a:e,g))))
# a b c d e f g h i j k l m n o p q r s t u v w x y z maxcol
# 1 1 4 9 2 4 4 1 10 2 3 10 4 7 1 10 9 8 2 8 9 5 1 9 1 10 9 9
# 2 5 2 5 3 5 2 8 8 5 8 2 3 6 10 9 3 5 8 7 4 6 9 8 5 8 3 8
# 3 10 9 6 1 7 10 6 4 4 7 6 6 2 7 5 5 4 1 10 7 3 10 5 10 1 7 10
# 4 8 1 4 8 9 3 3 9 10 1 8 5 8 4 4 8 6 10 5 2 9 5 7 7 3 1 9
# 5 2 10 2 9 8 9 9 6 7 5 9 2 5 5 7 4 2 5 4 8 4 6 6 2 9 6 10
应优先于其他答案(通常使用所谓的惯用方法)的原因是:
The reason this should be preferred over the other answers (which are generally using allegedly idiomatic methods) is because:
- 在Dom的答案中,
max
函数对于帧的每一行都被调用一次;R的向量化操作未使用,效率低下,应尽可能避免; 在akrun的答案中,
pmax
对于帧的每一列都被调用一次,在这种情况下,这听起来可能更糟,但实际上更接近于最好的情况.我的答案与akrun最接近,因为我们正在 mutate
中选择 select
数据.- in Dom's answer, the
max
function is called once for each row of the frame; R's vectorized ops are not being used, this is inefficient and should be avoided if possible; - in akrun's answer,
pmax
is being called once for each column of the frame, which in this case might sound worse but actually closer to the best one can do. My answer is closest to akrun's in that we areselect
ing data within themutate
.
如果您希望在 base :: subset
上使用 dplyr :: select
,则需要将其分解为
If you'd prefer to use dplyr::select
over base::subset
, it needs to be broken out as
foo %>%
mutate(maxcol = select(., a:e, g) %>% do.call(pmax, .))
我认为通过基准测试可以更好地证明这一点.使用提供的5x26帧,我们可以看到明显的改进:
I think this is demonstrated a little better with benchmarks. Using the provided 5x26 frame, we see a clear improvement:
set.seed(42)
foo <- data.frame(sapply(letters, function(x) x = sample(1:10,5)))
microbenchmark::microbenchmark(
Dom = {
foo %>%
rowwise() %>%
summarise(max= max(c_across(a:z)))
},
akr = {
foo %>%
mutate(maxcol = reduce(select(., a:z), pmax))
},
r2 = {
foo %>%
mutate(maxcol = do.call(pmax, subset(., select = a:z)))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Dom 6.6561 7.15260 7.61574 7.38345 7.90375 11.0387 100
# akr 4.2849 4.69920 4.96278 4.86110 5.18130 7.0908 100
# r2 2.3290 2.49285 2.68671 2.59180 2.78960 4.7086 100
让我们尝试使用稍大的5000x26:
Let's try with a slightly larger 5000x26:
set.seed(42)
foo <- data.frame(sapply(letters, function(x) x = sample(1:10,5000,replace=TRUE)))
microbenchmark::microbenchmark(
Dom = {
foo %>%
rowwise() %>%
summarise(max= max(c_across(a:z)))
},
akr = {
foo %>%
mutate(maxcol = reduce(select(., a:z), pmax))
},
r2 = {
foo %>%
mutate(maxcol = do.call(pmax, subset(., select = a:z)))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Dom 515.6437 563.6060 763.97348 811.45815 883.00115 1775.2366 100
# akr 4.6660 5.1619 11.92847 5.74050 6.50625 293.7444 100
# r2 2.9253 3.4371 4.24548 3.71845 4.27380 14.0958 100
这最后一个无疑显示了使用 rowwise
的结果.akrun的答案与该答案之间的相对性能几乎等于5行,这强化了一个前提,即列方式要好于行方式(并且一次要快于两者).
This last one definitely shows a consequence of using rowwise
. The relative performance between akrun's answer and this one is almost identical to 5 rows, reinforcing the premise that column-wise is better than row-wise (and all-at-once is faster than both).
(如果确实需要,也可以使用 purrr :: invoke
来完成,尽管它不能加快速度:
(This can also be done with purrr::invoke
, if truly desired, though it does not speed it up:
library(purrr)
foo %>%
mutate(maxcol = invoke(pmax, subset(., select = a:z)))
### microbenchmark(...)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Dom 7.8292 8.40275 9.02813 8.97345 9.38500 12.4368 100
# akr 4.9622 5.28855 8.78909 5.60090 6.11790 309.2607 100
# r2base 2.5521 2.74635 3.01949 2.90415 3.21060 4.6512 100
# r2purrr 2.5063 2.77510 3.11206 2.93415 3.33015 5.2403 100
这篇关于如何使用列的范围而不是pmax/pmin的名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!