在dplyr中用转换后的数据替换SQL数据库表 [英] Replacing SQL database table with transformed data in dplyr
问题描述
在dplyr中转换数据时,我经常替换数据,尤其是在处理大型数据集时.当我使用SQL支持的数据集时,至少在使用SQLite时,我不确定如何优雅地做到这一点.
I frequently replace data as I transform it in dplyr, especially when working with large datasets. I'm not sure how to do this elegantly when I am working with SQL-backed datasets, at least not with SQLite.
在dplyr数据库小插曲中或在SO上找不到关于此目标的任何讨论,这也让我想知道我最初所做的事情是否有问题.但是,这似乎是处理大型数据集的自然方法.
I could not find any discussion of this goal in the dplyr DB vignettes or on SO, which also makes me wonder if there's something wrong with what I'm doing in the first place; however, it seems to be a natural way to work with large datasets.
无论如何,最直观的方法行不通:
At any rate, the most intuitive approach does not work:
library(dplyr)
library(RSQLite)
db2 <- src_sqlite("trouble.sqlite", create = TRUE)
trouble <- data.frame(values = c(5, 1, 3))
trouble.db <- copy_to(db2, trouble, temporary = FALSE)
collect(trouble.db) # 5, 3, 1
trouble.db <- trouble.db %>% arrange(values)
collect(trouble.db) # 1, 3, 5
trouble.in <- tbl(db2, sql("SELECT * from trouble"))
collect(trouble.in) # 5, 3, 1
就地复制的另一种直观语法给出表已存在"错误:
Another intuitive syntax for in-place copy gives a "table already exists" error:
trouble.db <- copy_to(db2, as.data.frame(trouble.db), name="trouble", temporary = FALSE)
一种解决方案是手动删除表并重建它,这就是我一直在做的事情:
One solution is to manually drop the table and rebuild it, which is what I've been doing:
db2$con %>% db_drop_table(table = "trouble")
trouble <- collect(trouble.db)
trouble.db <- copy_to(db2, trouble, temporary = FALSE)
另一种方法是放弃替换并创建一系列临时表,我觉得这些表不美观,但我认为这可能是推荐的范例:
Another one is to give up on replacement and create a series of temporary tables, which I find unaesthetic but which I suppose might be the recommended paradigm:
trouble_temp <- data.frame(values = c(5, 1, 3))
trouble_temp.db <- copy_to(db2, trouble_temp, temporary = TRUE)
trouble <- trouble.db %>% arrange(values)
trouble.db <- copy_to(db2, trouble, temporary = FALSE)
我怀疑拖放并复制"将成为答案,但是出于对精美解决方案的热爱,我想问一下是否有更好的方法.
I suspect that "drop and re-copy" is going to wind up being the answer, but out of an abundance of love for beautiful solutions, I thought I would ask if there is a better way.
推荐答案
此年之后找到的任何人.
For anyone finding this years after.
声明
trouble.db %>% arrange(values)
创建一个SQL查询,该查询将发送到数据库并在您收集
结果时执行.
Creates an SQL query that's sent to the database and executed when you collect
the result.
我们可以看到这样的SQL
We can see that SQL like this
trouble.db %>% arrange(values) %>% show_query()
SELECT *
FROM `trouble`
ORDER BY `values`
很显然,这样的查询无法修改其查询的实际数据.
Clearly a query like this can't modify the actual data it's querying.
要修改数据,我们可以使用 DBI
包中的 dbWriteTable
函数
To modify the data, we can use dbWriteTable
function from the DBI
package
library(dplyr)
library(RSQLite)
# Will use this connection object for all our DB interactions
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
trouble <- data.frame(values = c(5, 1, 3))
trouble.db <- copy_to(con, trouble, temporary = FALSE)
collect(trouble.db) # 5, 3, 1
# This is just a query
trouble.db <- trouble.db %>% arrange(values)
collect(trouble.db) # 1, 3, 5
# The data shouldn't be modified yet
trouble.in <- tbl(con, sql("SELECT * from trouble"))
collect(trouble.in) # 5, 3, 1
# Now we are modifying the data
DBI::dbWriteTable(
con, "trouble", collect(trouble.db),
overwrite = TRUE
)
tbl(con, sql("SELECT * from trouble")) %>% collect() # 1, 3, 5
这篇关于在dplyr中用转换后的数据替换SQL数据库表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!