如何将SQL传送到R的dplyr? [英] How to pipe SQL into R's dplyr?

查看:107
本文介绍了如何将SQL传送到R的dplyr?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以在R中使用以下代码在任何通用SQL数据库中选择不同的行。我会使用 dplyr :: distinct(),但SQL语法不支持。无论如何,这确实有效:

I can use the following code in R to select distinct rows in any generic SQL database. I'd use dplyr::distinct() but it's not supported in SQL syntax. Anyways, this does indeed work:

dbGetQuery(database_name, 
           "SELECT t.* 
           FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM 
           FROM table_name t
           ) t 
           WHERE SEQNUM = 1;")

我一直在成功使用它,但是想知道如何在其他dplyr步骤之后传递相同的SQL查询,而不是仅仅如上所示将其用作第一步。最好用一个例子来说明:

I've been using it with success, but wonder how I can pipe that same SQL query after other dplyr steps, as opposed to just using it as a first step as shown above. This is best illustrated with an example:

distinct.df <- 
  left_join(sql_table_1, sql_table_2, by = "col5") %>% 
  sql("SELECT t.* 
      FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM 
      FROM table_name t
      ) t 
      WHERE SEQNUM = 1;")

所以我 dplyr :: left_join()两个SQL表,然后我要查看不同的行,并保留所有列。是否如上所述将SQL代码通过管道传递到R中(简单地使用 sql()函数)?如果是的话,我将在 FROM table_name t 行上为 table_name 使用什么?

So I dplyr::left_join() two SQL tables, then I want to look at distinct rows, and keep all columns. Do I pipe SQL code into R as shown above (simply utilizing the sql() function)? And if so what would I use for the table_name on the line FROM table_name t?

在我的第一个示例中,我使用了从中提取的实际表名。太明显了!但是在这种情况下,我使用管道,习惯于使用magrittr代词或有时使用rlang的 .data 代词

In my first example I use the actual table name that I'm pulling from. It's too obvious! But in this case I am piping and am used to using the magrittr pronoun . or sometimes the .data pronoun from rlang if I were in memory working in R without databases.

虽然我在SQL数据库中,但是我该如何处理这种情况?我如何正确地将已知有效的SQL管道传送到我的R代码中(使用适当的表名代词)? dbplyr的参考页是一个很好的起点,但是并不能真正回答这个特定问题。 / p>

I'm in a SQL database though... so how do I handle this situation? How do I properly pipe my known working SQL into my R code (with a proper table name pronoun)? dbplyr's reference page is a good starting point but doesn't really answer this specific question.

推荐答案

您似乎希望将自定义SQL代码与 dbplyr 。为此,重要的是要区分:

It looks like you are wanting to combine custom SQL code with auto-generated SQL code from dbplyr. For this it is important to distinguish between:


  • DBI :: db * 命令-在数据库上执行提供的SQL并返回结果。

  • dbplyr 转换-在此处与远程连接到table

  • DBI::db* commands - that execute the provided SQL on the database and return the result.
  • dbplyr translation - where you work with a remote connection to a table

您只能以某些方式将它们组合在一起。下面根据您的特定用例给出了几个示例。所有人都假定 DISTINCT 是您的特定SQL环境中可接受的命令。

You can only combine these in certain ways. Below I have given several examples depending on your particular use case. All assume that DISTINCT is a command that is accepted in your specific SQL environment.

如果您会自我推广,我建议您看看我的 dbplyr_helpers GitHub存储库(此处)。这包括:

If you'll excuse some self-promotion, I recommend you take a look at my dbplyr_helpers GitHub repository (here). This includes:


  • union_all 函数,该函数接受通过<$ c访问的两个表$ c> dbplyr 并使用一些自定义SQL代码输出单个表。

  • write_to_datebase 函数通过 dbplyr 访问的表,并将其转换为可以通过 DBI :: dbExecute
  • $执行的代码b $ b
  • union_all function that takes in two tables accessed via dbplyr and outputs a single table using some custom SQL code.
  • write_to_datebase function that takes a table accessed via dbplyr and converts it to code that can be executed via DBI::dbExecute

dbplyr 自动当您使用标准的 dplyr 动词(已定义SQL转换)时,将您的代码传送到下一个查询中。只要定义了sql转换,您就可以将许多管道(我一次使用10个或更多)链接在一起,(几乎)唯一的缺点是sql转换的查询对于人类来说很难阅读。

dbplyr automatically pipes your code into the next query for you when you are working with standard dplyr verbs for which there are SQL translations defined. So long as sql translations are defined you can chain together many pipes (I used 10 or more at once) with the (almost) only disadvantage being that the sql translated query gets difficult for a human to read.

例如,考虑以下内容:

library(dbplyr)
library(dplyr)

tmp_df = data.frame(col1 = c(1,2,3), col2 = c("a","b","c"))

df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())

df = left_join(df1, df2, by = "col1") %>%
  distinct()

然后调用 show_query(df) R返回以下自动生成的SQL代码:

When you then call show_query(df) R returns the following auto-generated SQL code:

SELECT DISTINCT *
FROM (

SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)

) `dbplyr_002`

但格式不正确。请注意,初始命令(左联接)以嵌套查询的形式出现,外部查询中的字母与众不同。因此, df 是由上述sql查询定义的远程数据库表的R链接。

But not as nicely formatted. Note that the initial command (left join) appears as a nested query, with a distinct in the outer query. Hence df is an R link to a remote database table defined by the above sql query.

您可以将 dbplyr 传递到自定义SQL函数中。管道意味着被管道传输的东西成为接收函数的第一个参数。

You can pipe dbplyr into custom SQL functions. Piping means that the thing being piped becomes the first argument of the receiving function.

custom_distinct <- function(df){
  db_connection <- df$src$con

  sql_query <- build_sql(con = db_connection,
                         "SELECT DISTINCT * FROM (\n",
                         sql_render(df),
                         ") AS nested_tbl"
  )
  return(tbl(db_connection, sql(sql_query)))
}

df = left_join(df1, df2, by = "col1") %>%
  custom_distinct()

当您随后调用 show_query(df)时,R应该返回以下SQL代码(我说应该,因为我无法在模拟的SQL连接中使用它),但不是格式正确:

When you then call show_query(df) R should return the following SQL code (I say 'should' because I can not get this working with simulated sql connections), but not as nicely formatted:

SELECT DISTINCT * FROM (

SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)

) nested_tbl

与前面的示例一样, df 是指向定义的远程数据库表的R链接

As with the previous example, df is an R link to a remote database table defined by the above sql query.

您可以从现有的<$中获取代码c $ c> dbplyr 远程表,并将其转换为可以使用 DBI :: db * 执行的字符串。

You can take the code from an existing dbplyr remote table and convert it to a string that can be executed using DBI::db*.

另一种编写不同查询的方式:

As another way of writing a distinct query:

df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())

df = left_join(df1, df2, by = "col1")

custom_distinct2 = paste0("SELECT DISTINCT * FROM (",
                          as.character(sql_render(df)),
                          ") AS nested_table")

local_table =   dbGetQuery(db_connection, custom_distinct2)

将使用前面的示例中的等效sql命令返回本地R数据帧。

Which will return a local R dataframe with the equivalent sql command as per the previous examples.

这篇关于如何将SQL传送到R的dplyr?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆