如何连接到 Amazon Redshift 或 Apache Spark 中的其他数据库? [英] How to connect to Amazon Redshift or other DB's in Apache Spark?
问题描述
我正在尝试通过 Spark 连接到 Amazon Redshift,因此我可以将 S3 上的数据与 RS 集群上的数据连接起来.我在这里找到了一些非常简洁的文档,用于连接到 JDBC 的能力:
I'm trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capability of connecting to JDBC:
https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases
加载命令看起来相当简单(虽然我不知道如何在此处输入 AWS 凭证,也许在选项中?).
The load command seems fairly straightforward (although I don't know how I would enter AWS credentials here, maybe in the options?).
df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")
而且我不完全确定如何处理 SPARK_CLASSPATH 变量.我现在通过 iPython notebook(作为 Spark 发行版的一部分)在本地运行 Spark.我在哪里定义它以便 Spark 加载它?
And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?
无论如何,就目前而言,当我尝试运行这些命令时,我收到了一堆无法辨认的错误,所以我现在有点卡住了.感谢任何帮助或指向详细教程的指针.
Anyway, for now, when I try running these commands, I get a bunch of undecipherable errors, so I'm kind of stuck for now. Any help or pointers to detailed tutorials are appreciated.
推荐答案
虽然这似乎是一个很老的帖子,但任何仍在寻找答案的人,以下步骤对我有用!
Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!
启动包含 jar 的 shell.
Start the shell including the jar.
bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar
通过提供适当的详细信息来创建 df:
Create a df by giving appropriate details:
myDF = spark.read
.format("jdbc")
.option("url", "jdbc:redshift://host:port/db_name")
.option("dbtable", "table_name")
.option("user", "user_name")
.option("password", "password")
.load()
Spark 版本:2.2
这篇关于如何连接到 Amazon Redshift 或 Apache Spark 中的其他数据库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!