直接使用dplyr突变数据库表中的变量 [英] Mutate variables in database tables directly using dplyr
问题描述
这是MonetDBLite数据库文件中的mtcars数据。
Here is mtcars data in the MonetDBLite database file.
library(MonetDBLite)
library(tidyverse)
library(DBI)
dbdir <- getwd()
con <- dbConnect(MonetDBLite::MonetDBLite(), dbdir)
dbWriteTable(conn = con, name = "mtcars_1", value = mtcars)
data_mt <- con %>% tbl("mtcars_1")
我想使用dplyr mutate创建新变量并将其添加(提交!)到数据库表中吗?
I want to use dplyr mutate to create new variables and add (commit!) that to the database table? Something like
data_mt %>% select(mpg, cyl) %>% mutate(var = mpg/cyl) %>% dbCommit(con)
当我们这样做时,所需的输出应该是相同的:
The desired output should be same when we do:
dbSendQuery(con, "ALTER TABLE mtcars_1 ADD COLUMN var DOUBLE PRECISION")
dbSendQuery(con, "UPDATE mtcars_1 SET var=mpg/cyl")
怎么办?
推荐答案
这里有几个功能, create
和 update.tbl_lazy
。
Here's a couple of functions, create
and update.tbl_lazy
.
它们分别实现了 CREATE TABLE
和 ALTER TABLE
/ UPDATE
对,它要少得多:
They respectively implement CREATE TABLE
, which was straightforward, and the ALTER TABLE
/UPDATE
pair which is much less so:
创建
create <- function(data,name){
DBI::dbSendQuery(data$src$con,
paste("CREATE TABLE", name,"AS", dbplyr::sql_render(data)))
dplyr::tbl(data$src$con,name)
}
示例:
library(dbplyr)
library(DBI)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
copy_to(con, head(iris,3),"iris")
tbl(con,"iris") %>% mutate(Sepal.Area= Sepal.Length * Sepal.Width) %>% create("iris_2")
# # Source: table<iris_2> [?? x 6]
# # Database: sqlite 3.22.0 []
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 17.8
# 2 4.9 3 1.4 0.2 setosa 14.7
# 3 4.7 3.2 1.3 0.2 setosa 15.0
更新
update.tbl_lazy <- function(.data,...,new_type="DOUBLE PRECISION"){
quos <- rlang::quos(...)
dots <- rlang::exprs_auto_name(quos, printer = tidy_text)
# extract key parameters from query
sql <- dbplyr::sql_render(.data)
con <- .data$src$con
table_name <-gsub(".*?(FROM (`|\")(.+?)(`|\")).*","\\3",sql)
if(grepl("\nWHERE ",sql)) where <- regmatches(sql, regexpr("WHERE .*",sql))
else where <- ""
new_cols <- setdiff(names(dots),colnames(.data))
# Add empty columns to base table
if(length(new_cols)){
alter_queries <- paste("ALTER TABLE",table_name,"ADD COLUMN",new_cols,new_type)
purrr::walk(alter_queries, ~{
rs <- DBI::dbSendStatement(con, .)
DBI::dbClearResult(rs)})}
# translate unevaluated dot arguments to SQL instructions as character
translations <- purrr::map_chr(dots, ~ translate_sql(!!! .))
# messy hack to make translations work
translations <- gsub("OVER \\(\\)","",translations)
# 2 possibilities: called group_by or (called filter or called nothing)
if(identical(.data$ops$name,"group_by")){
# ERROR if `filter` and `group_by` both used
if(where != "") stop("Using both `filter` and `group by` is not supported")
# Build aggregated table
gb_cols <- paste0('"',.data$ops$dots,'"',collapse=", ")
gb_query0 <- paste(translations,"AS", names(dots),collapse=", ")
gb_query <- paste("CREATE TABLE TEMP_GB_TABLE AS SELECT",
gb_cols,", ",gb_query0,
"FROM", table_name,"GROUP BY", gb_cols)
rs <- DBI::dbSendStatement(con, gb_query)
DBI::dbClearResult(rs)
# Delete temp table on exit
on.exit({
rs <- DBI::dbSendStatement(con,"DROP TABLE TEMP_GB_TABLE")
DBI::dbClearResult(rs)
})
# Build update query
gb_on <- paste0(table_name,'."',.data$ops$dots,'" = TEMP_GB_TABLE."', .data$ops$dots,'"',collapse=" AND ")
update_query0 <- paste0(names(dots)," = (SELECT ", names(dots), " FROM TEMP_GB_TABLE WHERE ",gb_on,")",
collapse=", ")
update_query <- paste("UPDATE", table_name, "SET", update_query0)
rs <- DBI::dbSendStatement(con, update_query)
DBI::dbClearResult(rs)
} else {
# Build update query in case of no group_by and optional where
update_query0 <- paste(names(dots),'=',translations,collapse=", ")
update_query <- paste("UPDATE", table_name,"SET", update_query0,where)
rs <- DBI::dbSendStatement(con, update_query)
DBI::dbClearResult(rs)
}
tbl(con,table_name)
}
示例1 ,定义2个新数字列:
example 1, define 2 new numeric columns :
tbl(con,"iris") %>% update(x=pmax(Sepal.Length,Sepal.Width),
y=pmin(Sepal.Length,Sepal.Width))
# # Source: table<iris> [?? x 7]
# # Database: sqlite 3.22.0 []
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species x y
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 5.1 3.5
# 2 4.9 3 1.4 0.2 setosa 4.9 3
# 3 4.7 3.2 1.3 0.2 setosa 4.7 3.2
示例2 ,修改现有列,创建2个不同类型的新列:
example 2, modify an existing column, create 2 new columns of different types :
tbl(con,"iris") %>%
update(x= Sepal.Length*Sepal.Width,
z= 2*y,
a= Species %||% Species,
new_type = c("DOUBLE","VARCHAR(255)"))
# # Source: table<iris> [?? x 9]
# # Database: sqlite 3.22.0 []
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species x y z a
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
# 1 5.1 3.5 1.4 0.2 setosa 17.8 3.5 7 setosasetosa
# 2 4.9 3 1.4 0.2 setosa 14.7 3 6 setosasetosa
# 3 4.7 3.2 1.3 0.2 setosa 15.0 3.2 6.4 setosasetosa
示例3 ,更新位置:
tbl(con,"iris") %>% filter(Sepal.Width > 3) %>% update(a="foo")
# # Source: table<iris> [?? x 9]
# # Database: sqlite 3.22.0 []
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species x y z a
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
# 1 5.1 3.5 1.4 0.2 setosa 17.8 3.5 7 foo
# 2 4.9 3 1.4 0.2 setosa 14.7 3 6 setosasetosa
# 3 4.7 3.2 1.3 0.2 setosa 15.0 3.2 6.4 foo
示例4 :按组更新
tbl(con,"iris") %>%
group_by(Species, Petal.Width) %>%
update(new_col1 = sum(Sepal.Width,na.rm=TRUE), # using a R function
new_col2 = MAX(Sepal.Length)) # using native SQL
# # Source: SQL [?? x 11]
# # Database: sqlite 3.22.0 []
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species x y z a new_col1 new_col2
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 1 2 7 foo 6.5 5.1
# 2 4.9 3 1.4 0.2 setosa 1 2 6 setosasetosa 6.5 5.1
# 3 7 3.2 4.7 1.4 versicolor 1 2 6.4 foo 3.2 7
一般注释
-
代码使用了
dbplyr :: translate_sql
,因此我们可以像过去一样使用R函数或本机函数变异
调用。
The code uses uses
dbplyr::translate_sql
so we can use R functions or native ones alike just like in good oldmutate
calls.
更新
在一个过滤器
调用或一个 group_by
调用或每个零的任何其他操作之后使用,您将得到一个错误或意外结果。
update
can only be used after one filter
call OR one group_by
call OR zero of each, anything else and you'll get an error or unexpected results.
group_by
实现非常hacky,因此没有空间动态定义列或按操作,要坚持基础。
The group_by
implementation is VERY hacky, so no room for defining columns on the fly or grouping by an operation, stick to the basics.
update
和 create
均返回 tbl(con,table_name)
,这意味着您可以链接多个 create
或 update
随心所欲地调用,并在两者之间使用适当量的 group_by
和 filter
。实际上,我所有的4个示例都可以链接。
update
and create
both return tbl(con, table_name)
, which means you can chain as many create
or update
calls as you wish, with the appropriate amount of group_by
and filter
in between. In fact all of my 4 examples can be chained.
要敲钉子, create
不会不受相同的限制,在调用它之前,您可以根据需要获得尽可能多的 dbplyr
乐趣。
To hammer the nail, create
doesn't suffer from the same restrictions, you can have as much dbplyr
fun as desired before calling it.
我没有实现类型检测,因此我需要 new_type
参数,该参数在 paste
调用中被回收代码中的 alter_queries
定义,因此它可以是单个值或向量。
I didn't implement type detection, so I needed the new_type
parameter, it is recycled in the paste
call of the alter_queries
definition in my code so it can be a single value or a vector.
解决后者的一种方法是从 translations
变量中提取变量,然后在 dbGetQuery(con , PRAGMA table_info(iris))
。然后,我们需要所有现有类型之间的强制规则,然后进行设置。但是由于不同的DBMS具有不同的类型,所以我想不出通用的方法,而且我也不知道 MonetDBLite
。
One way to solve the latter would be to extract the variables from the translations
variable, find their types in dbGetQuery(con,"PRAGMA table_info(iris)")
. Then we need coercion rules between all existing types, and we're set. But as different DBMS have different types I can't think of a general way to do it, and I don't know MonetDBLite
.
这篇关于直接使用dplyr突变数据库表中的变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!