Pyspark连接到ipython Notebook中的Postgres数据库 [英] Pyspark connection to Postgres database in ipython notebook
问题描述
我已经阅读了之前的文章,但是我仍然无法查明为什么我无法将ipython笔记本连接到Postgres数据库。
I've read previous posts on this, but I still cannot pinpoint why I am unable to connect my ipython notebook to a Postgres db.
我能够在ipython笔记本中启动pyspark时,SparkContext被加载为'sc'。
I am able to launch pyspark in an ipython notebook, SparkContext is loaded as 'sc'.
我在.bash_profile中具有以下内容,用于查找Postgres驱动程序:
I have the following in my .bash_profile for finding the Postgres driver:
export SPARK_CLASSPATH=/path/to/downloaded/jar
这是我在ipython笔记本中连接数据库的操作(基于此帖子):
Here's what I am doing in the ipython notebook to connect to the db (based on this post):
from pyspark.sql import DataFrameReader as dfr
sqlContext = SQLContext(sc)
table= 'some query'
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = dfr(sqlContext).jdbc(
url='jdbc:%s' % url, table=table, properties=properties
)
错误:
Py4JJavaError: An error occurred while calling o156.jdbc.
: java.SQL.SQLException: No suitable driver.
我知道找到我下载的驱动程序是错误的,但是我不明白为什么当我在.bash_profile中添加路径时,出现此错误。
I understand it's an error with finding the driver I've downloaded, but I don't understand why I am getting this error when I've added the path to it in my .bash_profile.
我也尝试通过pyspark --jars设置驱动程序,但出现没有这样的文件或目录错误。
I also tried to set driver via pyspark --jars, but I get a "no such file or directory" error.
此 Blogpost 还显示了如何连接到Postgres数据源,但是以下内容也给我一个没有这样的目录错误:
This blogpost also shows how to connect to Postgres data sources, but the following also gives me a "no such directory" error:
./bin/spark-shell --packages org.postgresql:postgresql:42.1.4
其他信息:
spark version: 2.2.0
python version: 3.6
java: 1.8.0_25
postgres driver: 42.1.4
推荐答案
我按照此文章中的说明进行操作。 SparkContext已经为我设置为sc,因此我要做的就是从.bash_profile中删除SPARK_CLASSPATH设置,然后在我的ipython笔记本中使用以下内容:
I followed directions in this post. SparkContext is already set as sc for me, so all I had to do was remove the SPARK_CLASSPATH setting from my .bash_profile, and use the following in my ipython notebook:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql-42.1.4.jar --jars /path/to/postgresql-42.1.4.jar pyspark-shell'
我添加了一个'驱动程序'设置为属性,它也有效。如本文其他地方所述,这可能是因为不建议使用SPARK_CLASSPATH,最好使用--driver-class-path。
I added a 'driver' settings to properties as well, and it worked. As stated elsewhere in this post, this is likely because SPARK_CLASSPATH is deprecated, and it is preferable to use --driver-class-path.
这篇关于Pyspark连接到ipython Notebook中的Postgres数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!