在jupyter笔记本中使用pyspark时如何指定驱动程序类路径? [英] How to specify driver class path when using pyspark within a jupyter notebook?

查看:210
本文介绍了在jupyter笔记本中使用pyspark时如何指定驱动程序类路径?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在jupyter笔记本中使用pyspark查询PostgreSQL.我浏览了关于StackOverflow的许多问题,但对我来说都不起作用,主要是因为答案似乎过时了.这是我的最小代码:

I want to query a PostgreSQL with pyspark within a jupyter notebook. I have browsed a lot of questions on StackOverflow but none of them worked for me, mainly because the answers seemed outdated. Here's my minimal code:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

从笔记本计算机运行此命令会引发以下错误:

Running this from a notebook would raise the following error:

Py4JJavaError: An error occurred while calling o69.jdbc.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at scala.Option.getOrElse(Option.scala:121)...

我发现的主要技巧总结在下面的链接中,但是很遗憾,我无法在笔记本中使用它们:

The principal tips I have found were summed up in the link below but unfortunately I can't get them to work in my notebook:

Pyspark与ipython笔记本中的Postgres数据库的连接

注意:我使用的是Spark 2.3.1和Python 3.6.3,如果指定了jar位置,则可以从pyspark shell连接到数据库.

Note: I am using Spark 2.3.1 and Python 3.6.3 and I am able to connect to the database from the pyspark shell if I specify the jar location.

pyspark --driver-class-path /home/.../postgresql.jar --jars /home/.../jars/postgresql.jar

感谢任何可以帮助我的人.

Thanks to anyone who can help me on this one.

编辑

如何在IPython Notebook中加载jar依赖项的答案,但对我不起作用.我已经尝试从笔记本配置环境变量:

The answers from How to load jar dependenices in IPython Notebook are already listed in the link I shared myself, and do not work for me. I already tried to configure the environment variable from the notebook:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql.jar --jars /path/to/postgresql.jar'

文件路径或文件本身没有问题,因为当我指定它并运行pyspark-shell时,它可以正常工作.

There's nothing wrong with the file path or the file itself since it works fine when I specify it and run the pyspark-shell.

推荐答案

使用config方法对我有用:

from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

这篇关于在jupyter笔记本中使用pyspark时如何指定驱动程序类路径?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆