dplyr tidyeval与下划线功能版本等效 [英] dplyr tidyeval equivalent of underscore function versions

查看:104
本文介绍了dplyr tidyeval与下划线功能版本等效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

dplyr的最新版本不赞成使用下划线版本的功能(例如filter_),而推荐使用整洁的评估.

Rencent versions of dplyr deprecate underscore versions of functions, such as filter_, in favour of tidy evaluation.

用新的方式期望下划线形式的新形式是什么?如何使用R CMD检查避免未定义符号?

What is expected new form of the underscore forms with the new way? How do I write avoiding undefined symbols with R CMD check?

library(dplyr)

df <- data_frame(id = rep(c("a","b"), 3), val = 1:6)
df %>% filter_(~id == "a")

# want to avoid this, because it references column id in a variable-style
df %>% filter( id == "a" )

# option A
df %>% filter( UQ(rlang::sym("id")) == "a" )
# option B
df %>% filter( UQ(as.name("id")) == "a" )
# option C
df %>% filter( .data$id == "a" )

是否存在首选或更多考虑的形式?选项C最短,但在我的一些现实世界中较大的数据集和更复杂的dplyr构造上却较慢:

Is there a preferred or more conside form? Option C is shortest but is slower on some of my real-world larger datasets and more complex dplyr constructs:

microbenchmark(
sym = dsPClosest %>%
  group_by(!!sym(dateVarName), !!sym("depth")) %>%
  summarise(temperature = mean(!!sym("temperature"), na.rm = TRUE)
            , moisture = mean(!!sym("moisture"), na.rm = TRUE)) %>%
  ungroup()
,data = dsPClosest %>%
    group_by(!!sym(dateVarName), .data$depth ) %>%
    summarise(temperature = mean(.data$temperature , na.rm = TRUE)
              , moisture = mean(.data$moisture , na.rm = TRUE)) %>%
    ungroup()  
,times=10
)
#Unit: milliseconds
# expr        min         lq      mean     median        uq       max neval
#  sym   80.05512   84.97267  122.7513   94.79805  100.9679  392.1375    10
# data 4652.83104 4741.99165 5371.5448 5039.63307 5471.9261 7926.7648    10

还有另一个针对mutate_的答案,它使用的语法更加复杂.

There is another answer for mutate_ using even more complex syntax.

推荐答案

根据您的评论,我想应该是:

Based on your comment, I guess it would be:

df %>% filter(!!as.name("id") == "a")

rlang是不必要的,因为您可以使用!!as.name而不是UQsym来做到这一点.

rlang is unnecessary, as you can do this with !! and as.name instead of UQ and sym.

但也许更好的选择是范围筛选器,它可以避免与quosure有关的问题:

But maybe a better option is a scoped filter, which avoids quosure-related issues:

df %>% filter_at(vars("id"), all_vars(. == "a"))

vars()上面的代码中,我们确定要在哪些列上应用过滤语句(在filter_at的帮助中,过滤语句称为谓词".在这种情况下,vars("id")表示过滤语句仅应用于id列.过滤语句可以是all_vars()any_vars()语句,尽管在这种情况下它们是等效的.all_vars(. == "a")表示必须等于"a".是的,这有点令人困惑.

In the code above vars() determines to which columns we're going to apply the filtering statement (in the help for filter_at, the filtering statement is called the "predicate". In this case, vars("id") means the filtering statement is applied only to the id column. The filtering statement can be either an all_vars() or any_vars() statement, though they're equivalent in this case. all_vars(. == "a") means that all of the columns in vars("id") must equal "a". Yes, it's a bit confusing.

类似于您的示例的数据计时:在这种情况下,我们使用group_by_atsummarise_at,它们是这些函数的作用域版本:

Timings for data similar to your example: In this case, we use group_by_at and summarise_at, which are scoped versions of those functions:

set.seed(2)
df <- data_frame(group = sample(1:100,1e4*52,replace=TRUE), 
                 id = rep(c(letters,LETTERS), 1e4), 
                 val = sample(1:50,1e4*52,replace=TRUE))

microbenchmark(
quosure=df %>% group_by(!!as.name("group"), !!as.name("id")) %>% 
  summarise(val = mean(!!as.name("val"))),
data=df %>% group_by(.data$group, .data$id) %>% 
  summarise(val = mean(.data$val)),
scoped_group_by = df %>% group_by_at(vars("group","id")) %>% 
  summarise_at("val", mean), times=10)

Unit: milliseconds
            expr       min        lq      mean    median        uq       max neval cld
         quosure  59.29157  61.03928  64.39405  62.60126  67.93810  72.47615    10  a 
            data 391.22784 394.65636 419.24201 413.74683 425.11709 498.42660    10   b
 scoped_group_by  69.57573  71.21068  78.26388  76.67216  82.89914  91.45061    10  a

原始答案

我认为在这种情况下,您将输入过滤器变量作为裸名,然后使用enquo!!(等效于UQ)来使用过滤器变量.例如:

I think this is a case where you would enter the filter variable as a bare name and then use enquo and !! (the equivalent of UQ) to use the filter variable. For example:

library(dplyr)

fnc = function(data, filter_var, filter_value) {
  filter_var=enquo(filter_var)
  data %>% filter(!!filter_var == filter_value)
}

fnc(df, id, "a")

     id   val
1     a     1
2     a     3
3     a     5

fnc(mtcars, carb, 3)

   mpg cyl  disp  hp drat   wt qsec vs am gear carb 
1 16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3 
2 17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3 
3 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3 

这篇关于dplyr tidyeval与下划线功能版本等效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆