使用R一次将多个记录插入Redshift [英] inserting multiple records at once into Redshift with R
问题描述
我需要每天多次将数千行推入Redshift.但是出于管理原因,我无法使用S3中的批量插入.最快的方法是什么?
I need to push a few thousand rows into Redshift multiple times a day. But I can't use bulk Insert from S3 for administrated reasons. What's the fastest way?
有3种方法(我可以看到),将行从R中插入行到Amazon Redshift中的表中:
There are 3 ways (that I can see) to insert rows into a table in Amazon Redshift from R:
- 逐行插入查询.每行插入为自己的
INSERT VALUES
查询 - 多行插入查询:与1相同,但每个查询插入多个记录.大小受行数或最大16 MB SQL查询大小限制的约束.
- 来自AWS S3,Dynamo或EMR的批量插入.
- Row by row insert query. Each row is inserted as its own
INSERT VALUES
query - Multiple Row insert query: same as 1, but multiple records are inserted per query. The size is constrained either by number of rows or the 16MB max SQL query size limitation.
- Bulk insert from AWS S3, Dynamo, or EMR.
以上每种方法都比以前的方法快一个数量级.我很乐意使用批量插入来创建或更新表,但是我们组织已做出安全和管理决定,不允许业务用户将数据批量加载或卸载到S3中.因此,我无法使用软件包 redshiftTools 上传文件.
Each approach above is ~ an order of magnitude faster than the prior one. I'd love to use bulk insert to create or update tables, but our organization has made the security and administrative decision not to allow business users to bulk load or unload data into S3. So that makes me unable to use the package redshiftTools to upload files.
我可以使用RODBC::sqlSave
做上面的数字1.速度很慢,但是最终完成了工作.
I can do number 1 above using RODBC::sqlSave
. Which is slow, but gets the job done... eventually.
我希望有一个类似sqlSave
的东西,它可以一次插入多行数据.但不要超过红移的行/大小限制.这对于简单的数据结构来说很容易,但是一个处理整数,字符,日期等的通用函数将是理想的,因为我不是只用一个表来完成此操作.因此,我打开了sqlSave
的源代码,并开始使用自己的函数来构建多个行插入,该插入会将数据分块为1000个行块,并为每个块构建并执行查询.
What'd I'd rather have is something like sqlSave
that inserts multiple rows of data at once. But not so much as to go over the row/size limits for redshift. This is easy for simple data structures, but a generalized function that would handle integers, characters, dates, etc would be ideal as I'm not doing this with only one single table. So I popped open the source for sqlSave
and started to roll my own function for building multiple row inserts that would chunk the data into 1000 row chunks and build and execute the query for each chunk.
但是我停下来询问这里是否已经完成?有没有更好的办法?我有一种感觉,也许其他R的SQL软件包之一具有执行此操作的功能.但是,当我搜索所有内容时,就会发现其他人也有同样的问题.
But I stopped to ask on here if this has already been done? Is there a better way? I have this feeling that maybe one of the other SQL packages for R has a function to do something like this. However, when I search all I find are other people having the same sort of problem.
有什么提示吗?
由于某些提示,我研究了RODBC::sqlSave
中的fast=TRUE
开关.该文档听起来像是我所追求的:
Thanks to some prompting, I investigated the fast=TRUE
switch in RODBC::sqlSave
. The documentation makes it sound like this is what I'm after:
快速:合乎逻辑.如果为假,则一次写入数据.如果为true,请使用参数化的INSERT INTO或UPDATE查询在一次操作中写入所有数据.
fast: logical. If false, write data a row at a time. If true, use a parametrized INSERT INTO or UPDATE query to write all the data in one operation.
所以我认为我应该对此进行测试.因此,我创建了一个包含10条记录和2列的小数据框:
So I figured I should test this out. So I created a little data frame with 10 records and 2 columns:
df <- data.frame(a=rnorm(10), b=sample(letters, 10, replace=TRUE),
stringsAsFactors = FALSE)
然后我使用benchmark
来定时执行5次复制的执行时间:
Then I used benchmark
to time the execution across 5 replications:
benchmark( sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=TRUE), replications = 5)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 sqlSave(dbhandle, df, tablename = "jal_test1", append = TRUE, fast = TRUE) 5 512.59 1 0.08 0.03 NA NA
benchmark( sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=FALSE), replications = 5)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 sqlSave(dbhandle, df, tablename = "jal_test1", append = TRUE, fast = FALSE) 5 264.37 1 0.08 0.02 NA NA
有点难以理解,但总而言之:
That's a little hard to read but, in summary:
-
fast=TRUE
用了512秒 -
fast=FALSE
用了264秒
fast=TRUE
took 512 secondsfast=FALSE
took 264 seconds
有25条记录,时间最多:
with 25 records, the time goes up to:
-
fast=TRUE
用了1208秒 -
fast=FALSE
花了604秒
fast=TRUE
took 1208 secondsfast=FALSE
took 604 seconds
这对我来说完全是零.
我尝试了test=TRUE
开关,认为它可以告诉我发生了什么,但我根本不知道该怎么做...但是转为verbose=TRUE
可以帮助我意识到fast=TRUE
不能执行任何操作我以为做到了.似乎使用了替换,但是并没有做一个大的插入.它仍然具有nrow(df)
个插入值:
I tried the test=TRUE
switch thinking it would show me what's going on, but I can't figure out what that does at all... However turning verbose=TRUE
helped me realize that fast=TRUE
does not do what I thought it did. It seems to use substitution, but does not do one big insert. It still does nrow(df)
worth of inserts:
> df <- data.frame(a=rnorm(5), b=sample(letters, 5, replace=TRUE), stringsAsFactors = FALSE)
> sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=FALSE, verbose=TRUE)
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '1', -1.45261402, 'd' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '2', -0.01642518, 'm' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '3', 1.11767938, 'm' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '4', -0.63480166, 'a' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '5', -0.75538702, 'k' )
> sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=TRUE, verbose=TRUE)
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( ?,?,? )
Binding: 'rownames' DataType -9, ColSize 255
Binding: 'a' DataType 6, ColSize 17
Binding: 'b' DataType -9, ColSize 255
Parameters:
no: 1: rownames 1/***/no: 2: a -1.45261/***/no: 3: b d/***/
no: 1: rownames 2/***/no: 2: a -0.0164252/***/no: 3: b m/***/
no: 1: rownames 3/***/no: 2: a 1.11768/***/no: 3: b m/***/
no: 1: rownames 4/***/no: 2: a -0.634802/***/no: 3: b a/***/
no: 1: rownames 5/***/no: 2: a -0.755387/***/no: 3: b k/***/
推荐答案
我最终找不到能够在R中进行分块的SQL写函数的实现.但是我确实看到在Python中,sqlalchemy
软件包组合在一起用pandas
可以很容易地做到这一点.因此,我剔除了Reticulate并将Python封装在一些R代码中,以创建要写入redshift的函数.似乎有点杀伤力,但它可以完成工作,而无需我重新实施任何操作:
I was ultimately unable to find an implementation of an SQL write function that would do chunking in R. But I did see that in Python the sqlalchemy
package teamed with pandas
could easily do this. So I whipped out Reticulate and wrapped Python in some R code to create a function to write to redshift. Seems like overkill, but it gets the job done without me having to reimplement anything:
start_python <- function(){
library(reticulate)
use_condaenv( "r-reticulate")
pd <- import('pandas')
sa <- import('sqlalchemy')
}
# write a table to RDW sandbox
write_to_redshift <- function(df, tablename, if_exists = 'append'){
pd_df <- r_to_py(df)
eng = sa$create_engine('postgres://user:pwd@redshift_name:5439/db_name')
conn = eng$connect()
write_result <- pd_df$to_sql( name=tablename, con=conn, index = FALSE, if_exists = if_exists, schema='my_schema', chunksize=10000L)
conn$close()
return(write_result)
}
这篇关于使用R一次将多个记录插入Redshift的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!