dplyr用动态列进行汇总 [英] dplyr summarise with dynamic columns
问题描述
我正在尝试对我的postgres数据库使用dplyr,并且正在执行一个简单的功能。
如果我直接解析列名,一切都会起作用,但是我想动态地做到这一点(即,从另一个数据框中对每个列名进行排序
I'm trying to use dplyr against my postgres database and am conducting a simple function. Everything works if I parse the column name directly, however I want to do this dynamically (i.e. sort through each column name from another dataframe
我的问题是m geeting用于前两个计算,我得到正确的结果
The problem I'm geeting is for the first two calculations, i'm getting the right results
假设第一个动态列称为 id
Assume the first dynamic column is called "id"
pull_table %>%
summarise(
row_count = n(),
distinct_count = n_distinct(var) ,
distinct_count_minus_blank = n_distinct(ifelse(var=="",NA,var)),
maxvalue = max(var),
minvalue = min(var),
maxlength = max(length(var)),
minlen = min(length(var))
) %>%
show_query()
当您看到sql时,我得到的错误结果很明显-有时id周围有'',因此它作为字符串进行计算:
The wrong result I get is obvious when you see the sql - sometimes id has '' around it so it's calculating as a string:
<SQL>
SELECT
COUNT(*) AS "row_count",
COUNT(DISTINCT id) AS "distinct_count",
COUNT(
DISTINCT CASE
WHEN ('id' = '') THEN (NULL)
WHEN NOT('id' = '') THEN ('id')
END) AS "distinct_count_minus_blank",
MAX('id') AS "maxvalue",
MIN('id') AS "minvalue",
MAX(LENGTH('id')) AS "maxlength",
MIN(LENGTH('id')) AS "minlen"
FROM "table"
您可以从此输出中看到,有时在列上进行计算,但有时只在字符串 id上进行。为什么会这样,如何解决它,以便它在实际列而不是字符串上进行计算?
You can see from this output that sometimes the calculation is happening on the column, but sometimes it's just happening on the string "id". Why is this and how can I fix it so it calculates on the actual column rather than the string?
推荐答案
我认为您应该看看 rlang :: sym
(由 dplyr
导入)。
I think you should look at rlang::sym
(which is imported by dplyr
).
假设 pull_table
是一个包含 id
, some_numeric_variable $ c的数据框$ c>和
some_character_variable
列,您可以这样写:
Assuming pull_table
is a dataframe including id
, some_numeric_variable
and some_character_variable
columns, you could write something like this:
xx = sym("id")
yy = sym("some_numeric_variable")
ww = sym("some_character_variable")
pull_table %>%
summarise(
row_count = n(),
distinct_count = n_distinct(!!xx) ,
distinct_count_minus_blank = n_distinct(ifelse(var=="", NA, !!xx)),
maxvalue = max(!!yy ),
minvalue = min(!!yy ),
maxlength = max(length(!!ww)),
minlen = min(length(!!ww))
)
sym()
函数将字符串
变量转换为名称
,可以在dplyr函数中使用<$ c $取消引用c> !! 运算符。如果您需要更多信息,请查看准引用文档或< a href = https://cran.r-project.org/package=dplyr/vignettes/programming.html rel = nofollow noreferrer>此教程。
The sym()
function turn a string
variable into a name
, which can be unquoted inside dplyr functions with the !!
operator. If you want more information, please take a look at the quasiquotation doc or this tuto.
不幸的是,由于我手边没有任何 tbl_sql
,因此无法使用 show_query $ c $进行测试c>。
Unfortunately, since I didn't have any tbl_sql
at hand, I couldn't test it with show_query
.
侧面建议:请不要将变量命名为 var,因为 var
也是方差功能。我把头发扯了很多遍,只是因为它弄乱了一些程序包或自定义功能。
Side advice: don't ever name your variables "var" as var
is also the variance function. I pulled my hair off many times just because this had messed up with some packages or custom functions.
这篇关于dplyr用动态列进行汇总的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!