R dplyr对仅由其字符串名称知道的列进行操作 [英] R dplyr operate on a column known only by its string name

查看:101
本文介绍了R dplyr对仅由其字符串名称知道的列进行操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力在R中使用 dplyr 进行编程,以对仅由其字符串名称知道的数据帧列进行操作。我知道最近对 dplyr 进行了更新,以支持quosures之类的东西,并且我在这里回顾了我认为是新的用dplyr编程文章的相关组件。 : http://dplyr.tidyverse.org/articles/programming.html 。但是,我仍然无法做我想做的事。

I am wrestling with programming using dplyr in R to operate on columns of a data frame that are only known by their string names. I know there was recently an update to dplyr to support quosures and the like and I've reviewed what I think are the relevant components of the new "Programming with dplyr" article here: http://dplyr.tidyverse.org/articles/programming.html. However, I'm still not able to do what I want.

我的情况是我仅通过字符串名称知道数据框的列名称。因此,我不能在函数甚至脚本中调用 dplyr 的过程中使用非标准评估,因为我不能-对未加引号(即裸)的列名进行编码。我想知道如何解决这个问题,我想我正在用新的引用/取消引用语法忽略某些东西。

My situation is that I know a column name of a data frame only by its string name. Thus, I can't use non-standard evaluation in a call to dplyr within a function or even a script where the column name may change between runs because I can't hard-code the unquoted (i.e., "bare") column name generally. I'm wondering how to get around this, and I'm guessing I'm overlooking something with the new quoting/unquoting syntax.

例如,假设我有用户定义用于数据分布的截止百分位数的输入。用户可以使用他/她想要的任何百分比来运行代码,并且他/她选择的百分比将改变输出。在分析中,将在中间数据框中创建一列,并使用所使用的百分位名称。因此,此列的名称会根据用户输入的截止百分位数而变化。

For example, suppose I have user inputs that define cutoff percentiles for a distribution of data. A user may run the code using any percentile he/she would like, and the percentile he/she picks will change the output. Within the analysis, a column in an intermediate data frame is created with the name of the percentile that is used; thus this column's name changes depending on the cutoff percentile input by the user.

下面是一个最小的示例。我想使用截止百分位数的各种值来调用该函数。我希望名为 MPGCutoffs 的数据框具有根据所选的截止分位数命名的列(当前在以下代码中有效),我想稍后对它进行操作此列名称。由于此列名的通用性,在编写函数时,我只能通过输入 pctCutoff 来了解它,因此我需要一种对其进行操作的方法当只知道 probColName 定义的字符串时,该字符串遵循基于 pctCutoff 的值的预定义模式。

Below is a minimal example to illustrate. I want to call the function with various values for the cutoff percentile. I want the data frame named MPGCutoffs to have a column that is named according to the chosen cutoff quantile (this currently works in the below code), and I want to later operate on this column name. Because of the generality of this column name, I can only know it in terms of the input pctCutoff at the time of writing the function, so I need a way to operate on it when only knowing the string defined by probColName, which follows a predefined pattern based on the value of pctCutoff.

userInput_prob1 <- 0.95
userInput_prob2 <- 0.9

# Function to get cars that have the "best" MPG
# fuel economy, where "best" is defined by the
# percentile cutoff passed to the function.
getBestMPG <- function( pctCutoff ){

  # Define new column name to hold the MPG percentile cutoff.
  probColName <- paste0('P', pctCutoff*100)

  # Compute the MPG percentile cutoff by number of gears.
  MPGCutoffs <- mtcars %>%
    dplyr::group_by( gear ) %>%
    dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )

  # Filter mtcars with only MPG values above cutoffs.
  output <- mtcars %>%
    dplyr::left_join( MPGCutoffs, by='gear' ) %>%
    dplyr::filter( mpg > !!probColName ) #****This doesn't run; this is where I'm stuck

  # Return filtered data.
  return(output)
}

best_1 <- getBestMPG( userInput_prob1 )
best_2 <- getBestMPG( userInput_prob2 )

dplyr :: filter()语句是我无法运行的正确地。我试过了:

The dplyr::filter() statement is what I can't get to run properly. I've tried:

dplyr :: filter(mpg> probColName)-没有错误,但是没有行

dplyr::filter( mpg > probColName ) - No error, but no rows returned.

dplyr :: filter(mpg> !! probColName)-没有错误,但没有返回行

dplyr::filter( mpg > !!probColName ) - No error, but no rows returned.

我还看到了一些示例,其中可以将类似 quo(P95)的内容传递给函数,然后在对 dplyr :: filter()的调用中取消引用;我已经开始使用它了,但是它不能解决我的问题,因为它需要在函数外对变量名进行硬编码。例如,如果我这样做并且用户传递的百分位数为0.90,则对 dplyr :: filter()的调用将失败,因为创建的列名为 P90 而不是 P95

I've also seen examples where I could pass something like quo(P95) to the function and then unquote it in the call to dplyr::filter(); I've gotten this to work, but it doesn't solve my problem since it requires hard-coding the variable name outside the function. For example, if I do this and the percentile passed by the user is 0.90, then the call to dplyr::filter() fails because the column created is named P90 and not P95.

任何帮助将不胜感激。我希望有一个简单的解决方案,我只是忽略了。

Any help would be greatly appreciated. I'm hoping there's an easy solution that I'm just overlooking.

推荐答案

如果字符串中有列名( aka字符向量),而您想在tidyeval中使用它,则可以使用 rlang :: sym()对其进行隐蔽。只需更改

If you have a column name in a string (aka character vector) and you want to use it with tidyeval, then you can covert it with rlang::sym(). Just change

dplyr::filter( mpg > !!rlang::sym(probColName) )

,它应该可以工作。摘自该github问题的建议: https://github.com/tidyverse/rlang/ issue / 116

and it should work. This is taken from the recommendation at this github issue: https://github.com/tidyverse/rlang/issues/116

仍然可以使用

dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )

因为动态设置参数名称,您只需要字符串而不是未编号的符号。

because when dynamically setting a parameter name, you just need the string and not an unqouted symbol.

这篇关于R dplyr对仅由其字符串名称知道的列进行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆