在jupyter笔记本中使用pyspark时如何指定驱动程序类路径? [英] How to specify driver class path when using pyspark within a jupyter notebook?
问题描述
我想在jupyter笔记本中使用pyspark查询PostgreSQL.我浏览了关于StackOverflow的许多问题,但对我来说都不起作用,主要是因为答案似乎过时了.这是我的最小代码:
I want to query a PostgreSQL with pyspark within a jupyter notebook. I have browsed a lot of questions on StackOverflow but none of them worked for me, mainly because the answers seemed outdated. Here's my minimal code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
从笔记本计算机运行此命令会引发以下错误:
Running this from a notebook would raise the following error:
Py4JJavaError: An error occurred while calling o69.jdbc.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at scala.Option.getOrElse(Option.scala:121)...
我发现的主要技巧总结在下面的链接中,但是很遗憾,我无法在笔记本中使用它们:
The principal tips I have found were summed up in the link below but unfortunately I can't get them to work in my notebook:
Pyspark与ipython笔记本中的Postgres数据库的连接
注意:我使用的是Spark 2.3.1和Python 3.6.3,如果指定了jar位置,则可以从pyspark shell连接到数据库.
Note: I am using Spark 2.3.1 and Python 3.6.3 and I am able to connect to the database from the pyspark shell if I specify the jar location.
pyspark --driver-class-path /home/.../postgresql.jar --jars /home/.../jars/postgresql.jar
感谢任何可以帮助我的人.
Thanks to anyone who can help me on this one.
编辑
如何在IPython Notebook中加载jar依赖项的答案
,但对我不起作用.我已经尝试从笔记本配置环境变量:The answers from How to load jar dependenices in IPython Notebook are already listed in the link I shared myself, and do not work for me. I already tried to configure the environment variable from the notebook:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql.jar --jars /path/to/postgresql.jar'
文件路径或文件本身没有问题,因为当我指定它并运行pyspark-shell时,它可以正常工作.
There's nothing wrong with the file path or the file itself since it works fine when I specify it and run the pyspark-shell.
推荐答案
使用config
方法对我有用:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
这篇关于在jupyter笔记本中使用pyspark时如何指定驱动程序类路径?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!