将filter_all(any_vars())转换为filter(across()) [英] Translating filter_all(any_vars()) to filter(across())
问题描述
在更新我对另一个线程的答案时,我无法提出一个好的解决方案来替换最后一个示例(请参见下文)。想法是获取所有 any 列包含特定字符串的行,在我的示例中为 V。
On updating my own answer to another thread, I wasn't able to come up with a good solution to replace the last example (see below). The idea is to get all rows where any column contains a certain string, in my example "V".
library(tidyverse)
#get all rows where any column contains 'V'
diamonds %>%
filter_all(any_vars(grepl('V',.))) %>%
head
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 2 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 4 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 6 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# this does naturally not give the desired output!
diamonds %>%
filter(across(everything(), ~ grepl('V', .))) %>%
head
#> # A tibble: 0 x 10
我发现了一个海报思考类似的东西,
### don't run, this is ugly and does not work
diamonds %>%
rowwise %>%
filter(any(grepl("V", across(everything())))) %>%
head
推荐答案
这很困难,因为该示例表明您要过滤所有列中的 个条件满足时(即,您想要一个 union ),这些列中的数据。这是通过 filter_all()
和 any_vars()
完成的。
This is very difficult, because the example shows that you want to filter data from all columns when any of them meets the condition (i.e. you want a union). That's done with filter_all()
and any_vars()
.
而 filter(across(everything(),...))
当所有 all 满足条件时,就会从 all 列中滤除条件(即这是一个 intersection ,与前一个截然相反)。
While filter(across(everything(), ...))
filters out from all columns when all of them meet the condition (i.e. this is a intersection, quite opposite of the previous).
要将其从 intersection 转换为 union (例如,要再次获得列中 any 满足条件的行),您可能需要检查以下行的总和:
To convert it from intersection to the union (i.e. to get again rows where any of the columns meet the condition), you probably need to check the row sum for that:
diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)
它将对出现在行中的所有 TRUE
求和,即至少一个满足条件的值,该行的总和将为> 0
并显示出来。
It will sum all the TRUE
s that appear in the row, i.e. if there is at least one value meeting the condition, that row sum will be > 0
and will be shown.
对不起 across()
并不是第一次 filter()
的子元素,但这至少是一些想法。 :-)
I'm sorry for across()
is not the very first child of filter()
, but it's at least some idea how to do that. :-)
评估:
使用@TimTeaFan的方法来检查:
Using @TimTeaFan's method to check that:
identical(
{diamonds %>%
filter_all(any_vars(grepl('V',.)))
},
{diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)
}
)
#> [1] TRUE
基准:
As根据我们在TimTeaFan的回答下进行的讨论,这是一个比较,令人惊讶的是,所有解决方案的时间都相似:
As per our discussion under TimTeaFan's answer, here is a comparison, surprisingly, all solutions have a similar time:
library(tidyverse)
microbenchmark::microbenchmark(
filter_all = {diamonds %>%
filter_all(any_vars(grepl('V',.)))},
purrr_reduce = {diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`))},
base_reduce = {diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% Reduce(`|`, .))},
rowsums = {diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)},
times = 100L,
check = "identical"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> filter_all 295.7235 302.1311 309.6455 305.0491 310.0335 449.3619 100
#> purrr_reduce 297.8220 302.4411 310.2829 306.2929 312.2278 461.0194 100
#> base_reduce 298.5033 303.6170 309.4147 306.1839 312.3518 409.5273 100
#> rowsums 295.3863 301.0281 307.8517 305.3142 309.4793 372.8867 100
由 reprex包(v0.3.0)
Created on 2020-07-14 by the reprex package (v0.3.0)
这篇关于将filter_all(any_vars())转换为filter(across())的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!