R dplyr对仅由其字符串名称知道的列进行操作 [英] R dplyr operate on a column known only by its string name
问题描述
我正在努力在R中使用 dplyr
进行编程,以对仅由其字符串名称知道的数据帧列进行操作。我知道最近对 dplyr
进行了更新,以支持quosures之类的东西,并且我在这里回顾了我认为是新的用dplyr编程文章的相关组件。 : http://dplyr.tidyverse.org/articles/programming.html 。但是,我仍然无法做我想做的事。
I am wrestling with programming using dplyr
in R to operate on columns of a data frame that are only known by their string names. I know there was recently an update to dplyr
to support quosures and the like and I've reviewed what I think are the relevant components of the new "Programming with dplyr" article here: http://dplyr.tidyverse.org/articles/programming.html. However, I'm still not able to do what I want.
我的情况是我仅通过字符串名称知道数据框的列名称。因此,我不能在函数甚至脚本中调用 dplyr
的过程中使用非标准评估,因为我不能-对未加引号(即裸)的列名进行编码。我想知道如何解决这个问题,我想我正在用新的引用/取消引用语法忽略某些东西。
My situation is that I know a column name of a data frame only by its string name. Thus, I can't use non-standard evaluation in a call to dplyr
within a function or even a script where the column name may change between runs because I can't hard-code the unquoted (i.e., "bare") column name generally. I'm wondering how to get around this, and I'm guessing I'm overlooking something with the new quoting/unquoting syntax.
例如,假设我有用户定义用于数据分布的截止百分位数的输入。用户可以使用他/她想要的任何百分比来运行代码,并且他/她选择的百分比将改变输出。在分析中,将在中间数据框中创建一列,并使用所使用的百分位名称。因此,此列的名称会根据用户输入的截止百分位数而变化。
For example, suppose I have user inputs that define cutoff percentiles for a distribution of data. A user may run the code using any percentile he/she would like, and the percentile he/she picks will change the output. Within the analysis, a column in an intermediate data frame is created with the name of the percentile that is used; thus this column's name changes depending on the cutoff percentile input by the user.
下面是一个最小的示例。我想使用截止百分位数的各种值来调用该函数。我希望名为 MPGCutoffs
的数据框具有根据所选的截止分位数命名的列(当前在以下代码中有效),我想稍后对它进行操作此列名称。由于此列名的通用性,在编写函数时,我只能通过输入 pctCutoff
来了解它,因此我需要一种对其进行操作的方法当只知道 probColName
定义的字符串时,该字符串遵循基于 pctCutoff
的值的预定义模式。
Below is a minimal example to illustrate. I want to call the function with various values for the cutoff percentile. I want the data frame named MPGCutoffs
to have a column that is named according to the chosen cutoff quantile (this currently works in the below code), and I want to later operate on this column name. Because of the generality of this column name, I can only know it in terms of the input pctCutoff
at the time of writing the function, so I need a way to operate on it when only knowing the string defined by probColName
, which follows a predefined pattern based on the value of pctCutoff
.
userInput_prob1 <- 0.95
userInput_prob2 <- 0.9
# Function to get cars that have the "best" MPG
# fuel economy, where "best" is defined by the
# percentile cutoff passed to the function.
getBestMPG <- function( pctCutoff ){
# Define new column name to hold the MPG percentile cutoff.
probColName <- paste0('P', pctCutoff*100)
# Compute the MPG percentile cutoff by number of gears.
MPGCutoffs <- mtcars %>%
dplyr::group_by( gear ) %>%
dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )
# Filter mtcars with only MPG values above cutoffs.
output <- mtcars %>%
dplyr::left_join( MPGCutoffs, by='gear' ) %>%
dplyr::filter( mpg > !!probColName ) #****This doesn't run; this is where I'm stuck
# Return filtered data.
return(output)
}
best_1 <- getBestMPG( userInput_prob1 )
best_2 <- getBestMPG( userInput_prob2 )
dplyr :: filter()
语句是我无法运行的正确地。我试过了:
The dplyr::filter()
statement is what I can't get to run properly. I've tried:
dplyr :: filter(mpg> probColName)
-没有错误,但是没有行
dplyr::filter( mpg > probColName )
- No error, but no rows returned.
dplyr :: filter(mpg> !! probColName)
-没有错误,但没有返回行
dplyr::filter( mpg > !!probColName )
- No error, but no rows returned.
我还看到了一些示例,其中可以将类似 quo(P95)
的内容传递给函数,然后在对 dplyr :: filter()
的调用中取消引用;我已经开始使用它了,但是它不能解决我的问题,因为它需要在函数外对变量名进行硬编码。例如,如果我这样做并且用户传递的百分位数为0.90,则对 dplyr :: filter()
的调用将失败,因为创建的列名为 P90
而不是 P95
。
I've also seen examples where I could pass something like quo(P95)
to the function and then unquote it in the call to dplyr::filter()
; I've gotten this to work, but it doesn't solve my problem since it requires hard-coding the variable name outside the function. For example, if I do this and the percentile passed by the user is 0.90, then the call to dplyr::filter()
fails because the column created is named P90
and not P95
.
任何帮助将不胜感激。我希望有一个简单的解决方案,我只是忽略了。
Any help would be greatly appreciated. I'm hoping there's an easy solution that I'm just overlooking.
推荐答案
如果字符串中有列名( aka字符向量),而您想在tidyeval中使用它,则可以使用 rlang :: sym()
对其进行隐蔽。只需更改
If you have a column name in a string (aka character vector) and you want to use it with tidyeval, then you can covert it with rlang::sym()
. Just change
dplyr::filter( mpg > !!rlang::sym(probColName) )
,它应该可以工作。摘自该github问题的建议: https://github.com/tidyverse/rlang/ issue / 116
and it should work. This is taken from the recommendation at this github issue: https://github.com/tidyverse/rlang/issues/116
仍然可以使用
dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )
因为动态设置参数名称,您只需要字符串而不是未编号的符号。
because when dynamically setting a parameter name, you just need the string and not an unqouted symbol.
这篇关于R dplyr对仅由其字符串名称知道的列进行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!