你能用Spark SQL / Hive / Presto直接从Parquet / S3复制到Redshift吗? [英] Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?
问题描述
我们有大量的服务器数据储存在 S3
(即将以 Parquet
格式存储)。数据需要一些转换,所以它不能是S3的直接拷贝。我将使用 Spark
来访问数据,但我想知道是不是用Spark来操作它,写回到S3,然后复制到Redshift,如果我可以跳过一个步骤,然后运行一个查询来提取/转换数据,然后直接将其复制到Redshift?
,完全有可能。
阅读实木复合地板的Scala代码(摘自 写入红移的Scala代码(摘自此处 ) We have huge amounts of server data stored in Sure thing, totally possible. Scala code to read parquet (taken from here) Scala code to write to redshift (taken from here)
这篇关于你能用Spark SQL / Hive / Presto直接从Parquet / S3复制到Redshift吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
val people:RDD [Person] = ...
people.write.parquet(people.parquet)
val parquetFile = sqlContext.read.parquet(people.parquet)//数据框
parquetFile.write
pre>
.format(com.databricks.spark.redshift)
。选项(url,jdbc:redshift:// redshifthost:5439 / database?user = username& password = pass)
.option(dbtable,my_table_copy)
.option( tempdir,s3n:// path / for / temp / data)
.mode(error)
.save()
S3
(soon to be in a Parquet
format). The data needs some transformation, and so it can't be a straight copy from S3. I'll be using Spark
to access the data, but I'm wondering if instead of manipulating it with Spark, writing back out to S3, and then copying to Redshift if I can just skip a step and run a query to pull/transform the data and then copy it straight to Redshift?val people: RDD[Person] = ...
people.write.parquet("people.parquet")
val parquetFile = sqlContext.read.parquet("people.parquet") //data frame
parquetFile.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()