dplyr用动态列进行汇总 [英] dplyr summarise with dynamic columns

查看:81
本文介绍了dplyr用动态列进行汇总的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对我的postgres数据库使用dplyr,并且正在执行一个简单的功能。
如果我直接解析列名,一切都会起作用,但是我想动态地做到这一点(即,从另一个数据框中对每个列名进行排序

I'm trying to use dplyr against my postgres database and am conducting a simple function. Everything works if I parse the column name directly, however I want to do this dynamically (i.e. sort through each column name from another dataframe

我的问题是m geeting用于前两个计算,我得到正确的结果

The problem I'm geeting is for the first two calculations, i'm getting the right results

假设第一个动态列称为 id

Assume the first dynamic column is called "id"

pull_table %>%
    summarise(
        row_count = n(), 
        distinct_count = n_distinct(var) , 
        distinct_count_minus_blank = n_distinct(ifelse(var=="",NA,var)), 
        maxvalue = max(var), 
        minvalue = min(var), 
        maxlength = max(length(var)), 
        minlen = min(length(var))
    )  %>% 
    show_query()

当您看到sql时,我得到的错误结果很明显-有时id周围有'',因此它作为字符串进行计算:

The wrong result I get is obvious when you see the sql - sometimes id has '' around it so it's calculating as a string:

<SQL>
SELECT 
    COUNT(*) AS "row_count", 
    COUNT(DISTINCT id) AS "distinct_count", 
    COUNT(
        DISTINCT CASE 
            WHEN ('id' = '') THEN (NULL) 
            WHEN NOT('id' = '') THEN ('id') 
        END) AS "distinct_count_minus_blank", 
    MAX('id') AS "maxvalue", 
    MIN('id') AS "minvalue", 
    MAX(LENGTH('id')) AS "maxlength", 
    MIN(LENGTH('id')) AS "minlen"
FROM "table"

您可以从此输出中看到,有时在列上进行计算,但有时只在字符串 id上进行。为什么会这样,如何解决它,以便它在实际列而不是字符串上进行计算?

You can see from this output that sometimes the calculation is happening on the column, but sometimes it's just happening on the string "id". Why is this and how can I fix it so it calculates on the actual column rather than the string?

推荐答案

我认为您应该看看 rlang :: sym (由 dplyr 导入)。

I think you should look at rlang::sym (which is imported by dplyr).

假设 pull_table 是一个包含 id some_numeric_variable some_character_variable 列,您可以这样写:

Assuming pull_table is a dataframe including id, some_numeric_variable and some_character_variable columns, you could write something like this:

xx = sym("id")
yy = sym("some_numeric_variable")
ww = sym("some_character_variable")
pull_table %>%
    summarise(
        row_count = n(), 
        distinct_count = n_distinct(!!xx) , 
        distinct_count_minus_blank = n_distinct(ifelse(var=="", NA, !!xx)), 
        maxvalue = max(!!yy ), 
        minvalue = min(!!yy ), 
        maxlength = max(length(!!ww)), 
        minlen = min(length(!!ww))
    )

sym()函数将字符串变量转换为名称,可以在dplyr函数中使用<$ c $取消引用c> !! 运算符。如果您需要更多信息,请查看准引用文档或< a href = https://cran.r-project.org/package=dplyr/vignettes/programming.html rel = nofollow noreferrer>此教程。

The sym() function turn a string variable into a name, which can be unquoted inside dplyr functions with the !! operator. If you want more information, please take a look at the quasiquotation doc or this tuto.

不幸的是,由于我手边没有任何 tbl_sql ,因此无法使用 show_query

Unfortunately, since I didn't have any tbl_sql at hand, I couldn't test it with show_query.

侧面建议:请不要将变量命名为 var,因为 var 也是方差功能。我把头发扯了很多遍,只是因为它弄乱了一些程序包或自定义功能。

Side advice: don't ever name your variables "var" as var is also the variance function. I pulled my hair off many times just because this had messed up with some packages or custom functions.

这篇关于dplyr用动态列进行汇总的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆