如何连接到 Amazon Redshift 或 Apache Spark 中的其他数据库? [英] How to connect to Amazon Redshift or other DB's in Apache Spark?

查看:24
本文介绍了如何连接到 Amazon Redshift 或 Apache Spark 中的其他数据库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过 Spark 连接到 Amazon Redshift,因此我可以将 S3 上的数据与 RS 集群上的数据连接起来.我在这里找到了一些非常简洁的文档,用于连接到 JDBC 的能力:

I'm trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capability of connecting to JDBC:

https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases

加载命令看起来相当简单(虽然我不知道如何在此处输入 AWS 凭证,也许在选项中?).

The load command seems fairly straightforward (although I don't know how I would enter AWS credentials here, maybe in the options?).

df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")

而且我不完全确定如何处理 SPARK_CLASSPATH 变量.我现在通过 iPython notebook(作为 Spark 发行版的一部分)在本地运行 Spark.我在哪里定义它以便 Spark 加载它?

And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?

无论如何,就目前而言,当我尝试运行这些命令时,我收到了一堆无法辨认的错误,所以我现在有点卡住了.感谢任何帮助或指向详细教程的指针.

Anyway, for now, when I try running these commands, I get a bunch of undecipherable errors, so I'm kind of stuck for now. Any help or pointers to detailed tutorials are appreciated.

推荐答案

虽然这似乎是一个很老的帖子,但任何仍在寻找答案的人,以下步骤对我有用!

Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!

启动包含 jar 的 shell.

Start the shell including the jar.

bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar

通过提供适当的详细信息来创建 df:

Create a df by giving appropriate details:

myDF = spark.read 
    .format("jdbc") 
    .option("url", "jdbc:redshift://host:port/db_name") 
    .option("dbtable", "table_name") 
    .option("user", "user_name") 
    .option("password", "password") 
    .load()

Spark 版本:2.2

这篇关于如何连接到 Amazon Redshift 或 Apache Spark 中的其他数据库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆