使用R一次将多个记录插入Redshift [英] inserting multiple records at once into Redshift with R

查看:80
本文介绍了使用R一次将多个记录插入Redshift的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要每天多次将数千行推入Redshift.但是出于管理原因,我无法使用S3中的批量插入.最快的方法是什么?

I need to push a few thousand rows into Redshift multiple times a day. But I can't use bulk Insert from S3 for administrated reasons. What's the fastest way?

有3种方法(我可以看到),将行从R中插入行到Amazon Redshift中的表中:

There are 3 ways (that I can see) to insert rows into a table in Amazon Redshift from R:

  1. 逐行插入查询.每行插入为自己的INSERT VALUES查询
  2. 多行插入查询:与1相同,但每个查询插入多个记录.大小受行数或最大16 MB SQL查询大小限制的约束.
  3. 来自AWS S3,Dynamo或EMR的批量插入.
  1. Row by row insert query. Each row is inserted as its own INSERT VALUES query
  2. Multiple Row insert query: same as 1, but multiple records are inserted per query. The size is constrained either by number of rows or the 16MB max SQL query size limitation.
  3. Bulk insert from AWS S3, Dynamo, or EMR.

以上每种方法都比以前的方法快一个数量级.我很乐意使用批量插入来创建或更新表,但是我们组织已做出安全和管理决定,不允许业务用户将数据批量加载或卸载到S3中.因此,我无法使用软件包 redshiftTools 上传文件.

Each approach above is ~ an order of magnitude faster than the prior one. I'd love to use bulk insert to create or update tables, but our organization has made the security and administrative decision not to allow business users to bulk load or unload data into S3. So that makes me unable to use the package redshiftTools to upload files.

我可以使用RODBC::sqlSave做上面的数字1.速度很慢,但是最终完成了工作.

I can do number 1 above using RODBC::sqlSave. Which is slow, but gets the job done... eventually.

我希望有一个类似sqlSave的东西,它可以一次插入多行数据.但不要超过红移的行/大小限制.这对于简单的数据结构来说很容易,但是一个处理整数,字符,日期等的通用函数将是理想的,因为我不是只用一个表来完成此操作.因此,我打开了sqlSave的源代码,并开始使用自己的函数来构建多个行插入,该插入会将数据分块为1000个行块,并为每个块构建并执行查询.

What'd I'd rather have is something like sqlSave that inserts multiple rows of data at once. But not so much as to go over the row/size limits for redshift. This is easy for simple data structures, but a generalized function that would handle integers, characters, dates, etc would be ideal as I'm not doing this with only one single table. So I popped open the source for sqlSave and started to roll my own function for building multiple row inserts that would chunk the data into 1000 row chunks and build and execute the query for each chunk.

但是我停下来询问这里是否已经完成?有没有更好的办法?我有一种感觉,也许其他R的SQL软件包之一具有执行此操作的功能.但是,当我搜索所有内容时,就会发现其他人也有同样的问题.

But I stopped to ask on here if this has already been done? Is there a better way? I have this feeling that maybe one of the other SQL packages for R has a function to do something like this. However, when I search all I find are other people having the same sort of problem.

有什么提示吗?

由于某些提示,我研究了RODBC::sqlSave中的fast=TRUE开关.该文档听起来像是我所追求的:

Thanks to some prompting, I investigated the fast=TRUE switch in RODBC::sqlSave. The documentation makes it sound like this is what I'm after:

快速:合乎逻辑.如果为假,则一次写入数据.如果为true,请使用参数化的INSERT INTO或UPDATE查询在一次操作中写入所有数据.

fast: logical. If false, write data a row at a time. If true, use a parametrized INSERT INTO or UPDATE query to write all the data in one operation.

所以我认为我应该对此进行测试.因此,我创建了一个包含10条记录和2列的小数据框:

So I figured I should test this out. So I created a little data frame with 10 records and 2 columns:

df <- data.frame(a=rnorm(10), b=sample(letters, 10, replace=TRUE),
                 stringsAsFactors = FALSE)    

然后我使用benchmark来定时执行5次复制的执行时间:

Then I used benchmark to time the execution across 5 replications:

benchmark( sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=TRUE), replications = 5)
#                                                                         test replications elapsed relative user.self sys.self user.child sys.child
# 1 sqlSave(dbhandle, df, tablename = "jal_test1", append = TRUE, fast = TRUE)            5  512.59        1      0.08     0.03         NA        NA

benchmark( sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=FALSE), replications = 5)
#                                                                          test replications elapsed relative user.self sys.self user.child sys.child
# 1 sqlSave(dbhandle, df, tablename = "jal_test1", append = TRUE, fast = FALSE)            5  264.37        1      0.08     0.02         NA        NA
      

有点难以理解,但总而言之:

That's a little hard to read but, in summary:

  • fast=TRUE用了512秒
  • fast=FALSE用了264秒
  • fast=TRUE took 512 seconds
  • fast=FALSE took 264 seconds

有25条记录,时间最多:

with 25 records, the time goes up to:

  • fast=TRUE用了1208秒
  • fast=FALSE花了604秒
  • fast=TRUE took 1208 seconds
  • fast=FALSE took 604 seconds

这对我来说完全是零.

我尝试了test=TRUE开关,认为它可以告诉我发生了什么,但我根本不知道该怎么做...但是转为verbose=TRUE可以帮助我意识到fast=TRUE不能执行任何操作我以为做到了.似乎使用了替换,但是并没有做一个大的插入.它仍然具有nrow(df)个插入值:

I tried the test=TRUE switch thinking it would show me what's going on, but I can't figure out what that does at all... However turning verbose=TRUE helped me realize that fast=TRUE does not do what I thought it did. It seems to use substitution, but does not do one big insert. It still does nrow(df) worth of inserts:

> df <- data.frame(a=rnorm(5), b=sample(letters, 5, replace=TRUE), stringsAsFactors = FALSE)
> sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=FALSE, verbose=TRUE)
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '1', -1.45261402, 'd' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '2', -0.01642518, 'm' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '3',  1.11767938, 'm' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '4', -0.63480166, 'a' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '5', -0.75538702, 'k' )
> sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=TRUE, verbose=TRUE)
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( ?,?,? )
Binding: 'rownames' DataType -9, ColSize 255
Binding: 'a' DataType 6, ColSize 17
Binding: 'b' DataType -9, ColSize 255
Parameters:
no: 1: rownames 1/***/no: 2: a -1.45261/***/no: 3: b d/***/
no: 1: rownames 2/***/no: 2: a -0.0164252/***/no: 3: b m/***/
no: 1: rownames 3/***/no: 2: a 1.11768/***/no: 3: b m/***/
no: 1: rownames 4/***/no: 2: a -0.634802/***/no: 3: b a/***/
no: 1: rownames 5/***/no: 2: a -0.755387/***/no: 3: b k/***/

推荐答案

我最终找不到能够在R中进行分块的SQL写函数的实现.但是我确实看到在Python中,sqlalchemy软件包组合在一起用pandas可以很容易地做到这一点.因此,我剔除了Reticulate并将Python封装在一些R代码中,以创建要写入redshift的函数.似乎有点杀伤力,但它可以完成工作,而无需我重新实施任何操作:

I was ultimately unable to find an implementation of an SQL write function that would do chunking in R. But I did see that in Python the sqlalchemy package teamed with pandas could easily do this. So I whipped out Reticulate and wrapped Python in some R code to create a function to write to redshift. Seems like overkill, but it gets the job done without me having to reimplement anything:

start_python <- function(){
  library(reticulate)
  use_condaenv( "r-reticulate")
  pd <- import('pandas')
  sa <- import('sqlalchemy')
}

# write a table to RDW sandbox
write_to_redshift <- function(df, tablename, if_exists = 'append'){
  pd_df <- r_to_py(df)
  eng = sa$create_engine('postgres://user:pwd@redshift_name:5439/db_name')
  conn = eng$connect()
  write_result <- pd_df$to_sql( name=tablename, con=conn,  index = FALSE, if_exists = if_exists, schema='my_schema', chunksize=10000L)
  conn$close()
  return(write_result)
}

这篇关于使用R一次将多个记录插入Redshift的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆