RPostgreSQL-与Amazon Redshift的R连接-如何写入/发布更大的数据集 [英] RPostgreSQL - R Connection to Amazon Redshift - How to WRITE/Post Bigger Data Sets

查看:118
本文介绍了RPostgreSQL-与Amazon Redshift的R连接-如何写入/发布更大的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试如何将R与亚马逊的Redshift连接-并为其他新手发布一个简短的博客。



取得一些不错的成绩-我能够做到大多数事情(创建表,选择数据,甚至是sqlSave或dbSendQuery逐行显示)但是,我还没有找到一种方法来一次性批量上传表(例如,将整个5X150 IRIS表/数据帧复制到



问题:对于新手来说,向RPostgreSQL提供有关如何将数据块写入/上传到Redshift的任何建议都是



RODBC:

  colnames(iris)<-tolower(colnames(iris))
sqlSave(channel,iris, iris,rownames = F)



太慢了!太慢了!一定是更好的方法150〜1.5分钟



  iris_results<-sqlQuery(channel, select * from iris where species ='virginica')#快速子集。 ork并显示在AWS Redshift仪表板上

sqlDrop(channel, iris,errors = FALSE)#清理玩具



RPostgreSQL

  dbSendQuery(con,创建表iris_200(segallength float,sepalwidth float,petallength float,petalwidth float,species VARCHAR(100));)
dbListFields(con, iris_200)



一张一张地插入四行



  dbSendQuery(con,插入iris_200值(5.1,3.5,1.4,0.2,'Iris-setosa');))

dbSendQuery (con,插入iris_200值(5.5,2.5,1.1,0.4,'Iris-setosa');)

dbSendQuery(con,插入iris_200值(5.2,3.3,1.2, 0.3,'Iris-setosa');)

dframe< -dbReadTable(con, iris_200)#ok

dbRemoveTable(con, iris_200)#并清理玩具



或通过桌子循环(大约每秒1次)



  for(i in 1:(dim(iris_200)[1])){
query<-paste( insert into iris_200 values(,iris_200 [i,1],, ,iris_200 [i,2],,,
iris_200 [i,3],,,iris_200 [i,4],,,,iris_200 [i,5], ',);,sep =)

print(paste( row,i,正在加载数据>> ,query))

dbSendQuery(con,query)
}



因此,简单地说,这是一种骇人听闻的/缓慢的方式-赞赏有关如何上传/插入批量数据的任何建议-谢谢!



完整代码在这里: p>



PS-收到此错误消息:不支持LOAD源(提示:仅允许基于S3或DynamoDB或基于EMR的负载)



< hr>

更新6/12/2015-可能无法以合理的速度直接加载批量数据,注意上面的错误消息,并在此博客中指出- http://dailytechnology.net/2013/08/03/ redshift您需要了解什么/



它会记录


现在我们已经创建了数据结构,我们如何获取数据?您有两种选择:
1)Amazon S3
2)Amazon DynamoDB
是的,您可以简单地运行一系列INSERT语句,但这会非常缓慢。 (!)



Amazon建议使用S3方法,我将对此进行简要介绍。除非您已经在使用DynamoDB并将其某些数据迁移到Redshift,否则我认为DynamoDB不会特别有用。



要从本地获取数据网络到S3 ........


RA:如果我发现此问题,将发布更新

解决方案

对于OP来说可能为时已晚,但如果有人发现相同的问题,我将在此处发布以供将来参考:



批量插入的步骤如下:




  • 在Redshift中创建与我的数据具有相同结构的表frame

  • 将数据拆分为N个部分

  • 将这些部分转换为Redshift可读的格式

  • 上传Amazon S3的所有部分

  • 在Redshift上运行COPY语句

  • 删除Amazon S3上的临时文件



我已经创建了一个R包,除了第一步外,它确实做到了这一点,它被称为redshiftTools:
https://github.com/sicarul/redshiftTools



到安装软件包,您需要执行以下操作:

  install.packages('devtools')
devtools :: install_github( RcppCore / Rcpp)
devtools :: install_github( rstats-db / DBI)
devtools :: install_github( rstats-db / RPostgres)
devtools :: install_github ( hadley / xml2)
install.packages( aws.s3,repos = c(getOption( repos), http://cloudyr.github.io/drat))
devtools :: install_github( sicarul / redshiftTools)

之后,您将可以使用像这样:

  library( aws.s3)
library(RPostgres)
library( redshiftTools)

con<-dbConnect(RPostgres :: Postgres(),dbname = dbname,
host ='my-redshift-url.amazon.com',port =' 5439',
user ='myuser',密码='mypassword',sslmode ='require')

rs_replace_table(my_data,dbcon = con,tableName ='mytable',buck et = mybucket)
rs_upsert_table(my_other_data,dbcon = con,tableName ='mytable',bucket = mybucket,keys = c('id','date'))

rs_replace_table 会截断目标表,然后完全从数据框中加载它,仅当您不关心其当前数据时才这样做。另一方面, rs_upsert_table 替换具有一致键的行,并插入表中不存在的行。


I'm experimenting with how to connect R with Amazon's Redshift - and publishing a short blog for other newbies.

Some good progress - I'm able to do most things (create tables, select data, and even sqlSave or dbSendQuery 'line by line' HOWEVER, I have not found a way to do a BULK UPLOAD of a table in one shot (e.g. copy the whole 5X150 IRIS table/data frame to Redshift) - that doesnt take more than a minute.

Question: Any advice for a newish person to RPostgreSQL on how to write/upload a block of data to Redshift would be greatly appreciated!

RODBC:

colnames(iris) <- tolower(colnames(iris)) 
sqlSave(channel,iris,"iris", rownames=F) 

SLOOOOOOW! SO SLOW! Must be a better way 150 ~1.5 minutes

iris_results <- sqlQuery(channel,"select * from iris where species = 'virginica'") # fast subset. this does work and shows up on AWS Redshift Dashboard

sqlDrop(channel, "iris", errors = FALSE) # clean up our toys

RPostgreSQL

dbSendQuery(con, "create table iris_200 (sepallength float,sepalwidth float,petallength float,petalwidth float,species VARCHAR(100));")
dbListFields(con,"iris_200")

ONE BY ONE insert four rows into the table

dbSendQuery(con, "insert into iris_200 values(5.1,3.5,1.4,0.2,'Iris-setosa');")

dbSendQuery(con, "insert into iris_200 values(5.5,2.5,1.1,0.4,'Iris-setosa');")

dbSendQuery(con, "insert into iris_200 values(5.2,3.3,1.2,0.3,'Iris-setosa');")

dframe <-dbReadTable(con,"iris_200") # ok

dbRemoveTable(con,"iris_200")  # and clean up toys

or loop through table (takes about 1 per second)

for (i in 1:(dim(iris_200)[1]) ) {
query <- paste("insert into iris_200 values(",iris_200[i,1],",",iris_200[i,2],",",
iris_200[i,3],",",iris_200[i,4],",","'",iris_200[i,5],"'",");",sep="")

print(paste("row",i,"loading data >>  ",query))

dbSendQuery(con, query)
}

So briefly, this is the hacky/slow way - any advice on how to upload/insert bulk data appreciated - thanks!!

Full code here:

PS - got this error message: LOAD source is not supported. (Hint: only S3 or DynamoDB or EMR based load is allowed)


Update 6/12/2015 - Direct Load of bulk data at reasonable speed may not be possible, noting error message above, and noted in this blog - LOADING DATA section of http://dailytechnology.net/2013/08/03/redshift-what-you-need-to-know/

It notes

So now that we’ve created out data structure, how do we get data into it? You have two choices: 1) Amazon S3 2) Amazon DynamoDB Yes, you could simply run a series of INSERT statements, but that is going to be painfully slow. (!)

Amazon recommends using the S3 method, which I will describe briefly. I don’t see the DynamoDB as particularly useful unless you’re already using that and want to migrate some of your data to Redshift.

To get the data from your local network to S3.....

RA: Will post updates if I figure this out

解决方案

It may be too late for the OP, but i'll post this here for future reference if someone finds the same issue:

The steps to do a Bulk insert are:

  • Create a table in Redshift with the same structure as my data frame
  • Split the data into N parts
  • Convert the parts into a format readable by Redshift
  • Upload all the parts to Amazon S3
  • Run the COPY statement on Redshift
  • Delete the temporary files on Amazon S3

I’ve created an R Package which does exactly this, except for the first step, and it’s called redshiftTools: https://github.com/sicarul/redshiftTools

To install the package, you’ll need to do:

install.packages('devtools')
devtools::install_github("RcppCore/Rcpp")
devtools::install_github("rstats-db/DBI")
devtools::install_github("rstats-db/RPostgres")
devtools::install_github("hadley/xml2")
install.packages("aws.s3", repos = c(getOption("repos"), "http://cloudyr.github.io/drat"))
devtools::install_github("sicarul/redshiftTools")

Afterwards, you’ll be able to use it like this:

library("aws.s3")
library(RPostgres)
library(redshiftTools)

con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
host='my-redshift-url.amazon.com', port='5439',
user='myuser', password='mypassword',sslmode='require')

rs_replace_table(my_data, dbcon=con, tableName='mytable', bucket="mybucket")
rs_upsert_table(my_other_data, dbcon=con, tableName = 'mytable', bucket="mybucket", keys=c('id', 'date'))

rs_replace_table truncates the target table and then loads it entirely from the data frame, only do this if you don’t care about the current data it holds. On the other hand, rs_upsert_table replaces rows which have coinciding keys, and inserts those that do not exist in the table.

这篇关于RPostgreSQL-与Amazon Redshift的R连接-如何写入/发布更大的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆