在EMR上将Postgresql JDBC源与Apache Spark结合使用 [英] Using Postgresql JDBC source with Apache Spark on EMR

查看:73
本文介绍了在EMR上将Postgresql JDBC源与Apache Spark结合使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行现有的EMR集群,并希望从Postgresql DB源创建DF.

I have existing EMR cluster running and wish to create DF from Postgresql DB source.

为此,您似乎需要使用更新后的spark.driver.extraClassPath修改spark-defaults.conf并指向已经在master& amp;上下载的相关PostgreSQL JAR.从属节点,,您可以将它们作为参数添加到spark-submit作业中.

To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.

由于我想使用现有的Jupyter笔记本来处理数据,而不是真正希望重新启动集群,所以解决此问题的最有效方法是什么?

Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?

我尝试了以下操作:

  1. 创建新目录(在主目录和从属目录上为/usr/lib/postgresql/,并将PostgreSQL jar复制到该目录.(postgresql-9.41207.jre6.jar)

  1. Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)

编辑了spark-default.conf以包含通配符位置

Edited spark-default.conf to include wildcard location

spark.driver.extraClassPath  :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$

  • 试图使用以下代码在Jupyter单元中创建数据框:

  • Tried to create dataframe in Jupyter cell using the following code:

    SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
    spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
    

  • 我收到如下所示的Java错误:

    I get a Java error as per below:

    Py4JJavaError: An error occurred while calling o396.jdbc.
    : java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
    

    帮助表示赞赏.

    推荐答案

    我认为您不需要在从属服务器中复制postgres jar,因为驱动程序和集群管理器会处理所有事情.我通过以下方式从Postgres外部源创建了数据框:

    I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:

    下载postgres驱动程序jar :

    cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
    

    创建数据框:

    atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
            .format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
                     'database' : <db>,
                     'dbtable' : <select * from table>}
     df=spark.read.format('jdbc').options(**attribute).load()
    

    提交工作: 提交Spark作业时,将下载的jar添加到驱动程序类路径.

    Submit to spark job: Add the the downloaded jar to driver class path while submitting the spark job.

    --properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5 
    

    这篇关于在EMR上将Postgresql JDBC源与Apache Spark结合使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆