直接使用dplyr突变数据库表中的变量 [英] Mutate variables in database tables directly using dplyr

查看:84
本文介绍了直接使用dplyr突变数据库表中的变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是MonetDBLite数据库文件中的mtcars数据。

Here is mtcars data in the MonetDBLite database file.

library(MonetDBLite)
library(tidyverse)
library(DBI)

dbdir <- getwd()
con <- dbConnect(MonetDBLite::MonetDBLite(), dbdir)

dbWriteTable(conn = con, name = "mtcars_1", value = mtcars)

data_mt <- con %>% tbl("mtcars_1")

我想使用dplyr mutate创建新变量并将其添加(提交!)到数据库表中吗?

I want to use dplyr mutate to create new variables and add (commit!) that to the database table? Something like

data_mt %>% select(mpg, cyl) %>% mutate(var = mpg/cyl) %>% dbCommit(con)

当我们这样做时,所需的输出应该是相同的:

The desired output should be same when we do:

dbSendQuery(con, "ALTER TABLE mtcars_1 ADD COLUMN var DOUBLE PRECISION")
dbSendQuery(con, "UPDATE mtcars_1 SET var=mpg/cyl") 

怎么办?

推荐答案

这里有几个功能, create update.tbl_lazy

Here's a couple of functions, create and update.tbl_lazy.

它们分别实现了 CREATE TABLE ALTER TABLE / UPDATE 对,它要少得多:

They respectively implement CREATE TABLE, which was straightforward, and the ALTER TABLE/UPDATE pair which is much less so:

创建

create <- function(data,name){
  DBI::dbSendQuery(data$src$con,
                   paste("CREATE TABLE", name,"AS", dbplyr::sql_render(data)))
  dplyr::tbl(data$src$con,name)
}

示例:

library(dbplyr)
library(DBI)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
copy_to(con, head(iris,3),"iris")

tbl(con,"iris") %>% mutate(Sepal.Area= Sepal.Length * Sepal.Width) %>% create("iris_2")

# # Source:   table<iris_2> [?? x 6]
# # Database: sqlite 3.22.0 []
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
#          <dbl>       <dbl>        <dbl>       <dbl> <chr>        <dbl>
# 1          5.1         3.5          1.4         0.2 setosa        17.8
# 2          4.9         3            1.4         0.2 setosa        14.7
# 3          4.7         3.2          1.3         0.2 setosa        15.0

更新

update.tbl_lazy <- function(.data,...,new_type="DOUBLE PRECISION"){
  quos <- rlang::quos(...)
  dots <- rlang::exprs_auto_name(quos, printer = tidy_text)

  # extract key parameters from query
  sql <- dbplyr::sql_render(.data)
  con  <- .data$src$con
  table_name <-gsub(".*?(FROM (`|\")(.+?)(`|\")).*","\\3",sql)
  if(grepl("\nWHERE ",sql)) where <-  regmatches(sql, regexpr("WHERE .*",sql))
  else where <- ""
  new_cols <- setdiff(names(dots),colnames(.data))

  # Add empty columns to base table
  if(length(new_cols)){
    alter_queries <- paste("ALTER TABLE",table_name,"ADD COLUMN",new_cols,new_type)
    purrr::walk(alter_queries, ~{
      rs <- DBI::dbSendStatement(con, .)
      DBI::dbClearResult(rs)})}

  # translate unevaluated dot arguments to SQL instructions as character
  translations  <- purrr::map_chr(dots, ~ translate_sql(!!! .))
  # messy hack to make translations work
  translations <- gsub("OVER \\(\\)","",translations) 

  # 2 possibilities: called group_by or (called filter or called nothing)
  if(identical(.data$ops$name,"group_by")){
    # ERROR if `filter` and `group_by` both used
    if(where != "") stop("Using both `filter` and `group by` is not supported")

    # Build aggregated table
    gb_cols   <- paste0('"',.data$ops$dots,'"',collapse=", ")
    gb_query0 <- paste(translations,"AS", names(dots),collapse=", ")
    gb_query  <- paste("CREATE TABLE TEMP_GB_TABLE AS SELECT",
                       gb_cols,", ",gb_query0,
                       "FROM", table_name,"GROUP BY", gb_cols)
    rs <- DBI::dbSendStatement(con, gb_query)
    DBI::dbClearResult(rs)

    # Delete temp table on exit
    on.exit({
      rs <- DBI::dbSendStatement(con,"DROP TABLE TEMP_GB_TABLE")
      DBI::dbClearResult(rs)
    })

    # Build update query
    gb_on <- paste0(table_name,'."',.data$ops$dots,'" = TEMP_GB_TABLE."', .data$ops$dots,'"',collapse=" AND ")
    update_query0 <- paste0(names(dots)," = (SELECT ", names(dots), " FROM TEMP_GB_TABLE WHERE ",gb_on,")",
                            collapse=", ")
    update_query <- paste("UPDATE", table_name, "SET", update_query0)
    rs <- DBI::dbSendStatement(con, update_query)
    DBI::dbClearResult(rs)

  } else {

    # Build update query in case of no group_by and optional where
    update_query0 <- paste(names(dots),'=',translations,collapse=", ")
    update_query  <- paste("UPDATE", table_name,"SET", update_query0,where)
    rs <- DBI::dbSendStatement(con, update_query)
    DBI::dbClearResult(rs)
  }
  tbl(con,table_name)
}

示例1 ,定义2个新数字列:

example 1, define 2 new numeric columns :

tbl(con,"iris") %>% update(x=pmax(Sepal.Length,Sepal.Width),
                           y=pmin(Sepal.Length,Sepal.Width))

# # Source:   table<iris> [?? x 7]
# # Database: sqlite 3.22.0 []
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     x     y
#          <dbl>       <dbl>        <dbl>       <dbl> <chr>   <dbl> <dbl>
# 1          5.1         3.5          1.4         0.2 setosa    5.1   3.5
# 2          4.9         3            1.4         0.2 setosa    4.9   3  
# 3          4.7         3.2          1.3         0.2 setosa    4.7   3.2

示例2 ,修改现有列,创建2个不同类型的新列:

example 2, modify an existing column, create 2 new columns of different types :

tbl(con,"iris") %>%
  update(x= Sepal.Length*Sepal.Width,
         z= 2*y,
         a= Species %||% Species,               
         new_type = c("DOUBLE","VARCHAR(255)"))

# # Source:   table<iris> [?? x 9]
# # Database: sqlite 3.22.0 []
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     x     y     z a           
#          <dbl>       <dbl>        <dbl>       <dbl> <chr>   <dbl> <dbl> <dbl> <chr>       
# 1          5.1         3.5          1.4         0.2 setosa   17.8   3.5   7   setosasetosa
# 2          4.9         3            1.4         0.2 setosa   14.7   3     6   setosasetosa
# 3          4.7         3.2          1.3         0.2 setosa   15.0   3.2   6.4 setosasetosa

示例3 ,更新位置:

tbl(con,"iris") %>% filter(Sepal.Width > 3) %>% update(a="foo")

# # Source:   table<iris> [?? x 9]
# # Database: sqlite 3.22.0 []
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     x     y     z a           
#          <dbl>       <dbl>        <dbl>       <dbl> <chr>   <dbl> <dbl> <dbl> <chr>       
# 1          5.1         3.5          1.4         0.2 setosa   17.8   3.5   7   foo         
# 2          4.9         3            1.4         0.2 setosa   14.7   3     6   setosasetosa
# 3          4.7         3.2          1.3         0.2 setosa   15.0   3.2   6.4 foo

示例4 :按组更新

tbl(con,"iris") %>%
  group_by(Species, Petal.Width) %>%
  update(new_col1 = sum(Sepal.Width,na.rm=TRUE), # using a R function
         new_col2 = MAX(Sepal.Length))           # using native SQL

# # Source:   SQL [?? x 11]
# # Database: sqlite 3.22.0 []
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species        x     y     z a            new_col1 new_col2
#          <dbl>       <dbl>        <dbl>       <dbl> <chr>      <dbl> <dbl> <dbl> <chr>           <dbl>    <dbl>
# 1          5.1         3.5          1.4         0.2 setosa         1     2   7   foo               6.5      5.1
# 2          4.9         3            1.4         0.2 setosa         1     2   6   setosasetosa      6.5      5.1
# 3          7           3.2          4.7         1.4 versicolor     1     2   6.4 foo               3.2      7 

一般注释


  • 代码使用了 dbplyr :: translate_sql ,因此我们可以像过去一样使用R函数或本机函数变异调用。

  • The code uses uses dbplyr::translate_sql so we can use R functions or native ones alike just like in good old mutate calls.

更新在一个过滤器调用或一个 group_by 调用或每个零的任何其他操作之后使用,您将得到一个错误或意外结果。

update can only be used after one filter call OR one group_by call OR zero of each, anything else and you'll get an error or unexpected results.

group_by 实现非常hacky,因此没有空间动态定义列或按操作,要坚持基础。

The group_by implementation is VERY hacky, so no room for defining columns on the fly or grouping by an operation, stick to the basics.

update create 均返回 tbl(con,table_name),这意味着您可以链接多个 create update 随心所欲地调用,并在两者之间使用适当量的 group_by filter 。实际上,我所有的4个示例都可以链接。

update and create both return tbl(con, table_name), which means you can chain as many create or update calls as you wish, with the appropriate amount of group_by and filter in between. In fact all of my 4 examples can be chained.

要敲钉子, create 不会不受相同的限制,在调用它之前,您可以根据需要获得尽可能多的 dbplyr 乐趣。

To hammer the nail, create doesn't suffer from the same restrictions, you can have as much dbplyr fun as desired before calling it.

我没有实现类型检测,因此我需要 new_type 参数,该参数在 paste 调用中被回收代码中的 alter_queries 定义,因此它可以是单个值或向量。

I didn't implement type detection, so I needed the new_type parameter, it is recycled in the paste call of the alter_queries definition in my code so it can be a single value or a vector.

解决后者的一种方法是从 translations 变量中提取变量,然后在 dbGetQuery(con , PRAGMA table_info(iris))。然后,我们需要所有现有类型之间的强制规则,然后进行设置。但是由于不同的DBMS具有不同的类型,所以我想不出通用的方法,而且我也不知道 MonetDBLite

One way to solve the latter would be to extract the variables from the translations variable, find their types in dbGetQuery(con,"PRAGMA table_info(iris)"). Then we need coercion rules between all existing types, and we're set. But as different DBMS have different types I can't think of a general way to do it, and I don't know MonetDBLite.

这篇关于直接使用dplyr突变数据库表中的变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆