如何使用 JDBC 将 (Py)Spark 连接到 Postgres 数据库 [英] How to connect (Py)Spark to Postgres database using JDBC

查看:40
本文介绍了如何使用 JDBC 将 (Py)Spark 连接到 Postgres 数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已按照 的说明进行操作此发布从现有的 Postgres 数据库中读取数据,该数据库具有名为objects"的表,该表由 SQLalchemy 中的 Objects 类定义和创建.在我的 Jupyter 笔记本中,我的代码是

I have followed instructions from this posting to read data from an existing Postgres database with table named "objects" as defined and created by the Objects class in SQLalchemy. In my Jupyter notebook, my code is

from pyspark import SparkContext
from pyspark import SparkConf
from random import random

#spark conf
conf = SparkConf()
conf.setMaster("local[*]")
conf.setAppName('pyspark')

sc = SparkContext(conf=conf)

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
properties = {
    "driver": "org.postgresql.Driver"
}
url = 'jdbc:postgresql://PG_USER:PASSWORD@PG_SERVER_IP/db_name'
df = sqlContext.read.jdbc(url=url, table='objects', properties=properties)

最后一行结果如下:

Py4JJavaError: An error occurred while calling o25.jdbc.
: java.lang.NullPointerException
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:158)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:117)
    at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:237)
    at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:159)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:211)
    at java.lang.Thread.run(Thread.java:745)

所以看起来它无法解析表格.我如何从这里测试以确保我正确连接到数据库?

so it looks like it can't resolve the table. How do I test from here to make sure that I am connected to the database properly?

推荐答案

名称解析问题由 org.postgresql.util.PSQLException 指示,不会导致 NPE.问题的根源实际上是连接字符串,尤其是您提供用户凭据的方式.乍一看,它看起来像一个错误,但如果您正在寻找快速解决方案,您可以使用 URL 属性:

Problems with name resolving are indicated by org.postgresql.util.PSQLException and don't result in NPE. The source of the issue is actually a connection string and in particular the way you provide user credentials. At first glance it looks like a bug but if you're looking for a quick solution you can either use URL properties:

url = 'jdbc:postgresql://PG_SERVER_IP/db_name?user=PG_USER&password=PASSWORD'

或属性参数:

properties = {
    "user": "PG_USER",
    "password": "PASSWORD",
    "driver": "org.postgresql.Driver"
}

这篇关于如何使用 JDBC 将 (Py)Spark 连接到 Postgres 数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆