你能用Spark SQL / Hive / Presto直接从Parquet / S3复制到Redshift吗? [英] Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?

查看:731
本文介绍了你能用Spark SQL / Hive / Presto直接从Parquet / S3复制到Redshift吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有大量的服务器数据储存在 S3 (即将以 Parquet 格式存储)。数据需要一些转换,所以它不能是S3的直接拷贝。我将使用 Spark 来访问数据,但我想知道是不是用Spark来操作它,写回到S3,然后复制到Redshift,如果我可以跳过一个步骤,然后运行一个查询来提取/转换数据,然后直接将其复制到Redshift?

解决方案

,完全有可能。

阅读实木复合地板的Scala代码(摘自

  val people:RDD [Person] = ... 
people.write.parquet(people.parquet)
val parquetFile = sqlContext.read.parquet(people.parquet)//数据框

写入红移的Scala代码(摘自此处

  parquetFile.write 
.format(com.databricks.spark.redshift)
。选项(url,jdbc:redshift:// redshifthost:5439 / database?user = username& password = pass)
.option(dbtable,my_table_copy)
.option( tempdir,s3n:// path / for / temp / data)
.mode(error)
.save()
pre>

We have huge amounts of server data stored in S3 (soon to be in a Parquet format). The data needs some transformation, and so it can't be a straight copy from S3. I'll be using Spark to access the data, but I'm wondering if instead of manipulating it with Spark, writing back out to S3, and then copying to Redshift if I can just skip a step and run a query to pull/transform the data and then copy it straight to Redshift?

解决方案

Sure thing, totally possible.

Scala code to read parquet (taken from here)

val people: RDD[Person] = ... 
people.write.parquet("people.parquet")
val parquetFile = sqlContext.read.parquet("people.parquet") //data frame

Scala code to write to redshift (taken from here)

parquetFile.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()

这篇关于你能用Spark SQL / Hive / Presto直接从Parquet / S3复制到Redshift吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆