将Jar添加到独立的pyspark [英] Add Jar to standalone pyspark

查看:105
本文介绍了将Jar添加到独立的pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在启动pyspark程序:

I'm launching a pyspark program:

$ export SPARK_HOME=
$ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip
$ python

还有py代码:

from pyspark import SparkContext, SparkConf

SparkConf().setAppName("Example").setMaster("local[2]")
sc = SparkContext(conf=conf)

如何添加jar依赖项,例如Databricks csv jar?使用命令行,我可以像这样添加软件包:

How do I add jar dependencies such as the Databricks csv jar? Using the command line, I can add the package like this:

$ pyspark/spark-submit --packages com.databricks:spark-csv_2.10:1.3.0 

但是我没有使用任何这些.该程序是不使用spark-submit的较大工作流的一部分.我应该能够运行我的./foo.py程序,并且应该可以正常工作.

But I'm not using any of these. The program is part of a larger workflow that is not using spark-submit I should be able to run my ./foo.py program and it should just work.

  • 我知道您可以为extraClassPath设置spark属性,但是必须将JAR文件复制到每个节点吗?
  • 尝试了conf.set("spark.jars","jar1,jar2")在py4j CNF异常下也不起作用

推荐答案

这里有很多方法(设置ENV变量,添加到$ SPARK_HOME/conf/spark-defaults.conf等)……掩盖这些.我想为专门使用Jupyter Notebook并从笔记本内部创建Spark会话的人添加一个其他答案.这是最适合我的解决方案(就我而言,我想加载Kafka软件包):

There are many approaches here (setting ENV vars, adding to $SPARK_HOME/conf/spark-defaults.conf, etc...) some of the answers already cover these. I wanted to add an additional answer for those specifically using Jupyter Notebooks and creating the Spark session from within the notebook. Here's the solution that worked best for me (in my case I wanted the Kafka package loaded):

spark = SparkSession.builder.appName('my_awesome')\
    .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0')\
    .getOrCreate()

使用这一行代码,我无需执行其他任何操作(无需更改ENV或conf文件).

Using this line of code I didn't need to do anything else (no ENVs or conf file changes).

2019-10-30更新: 上面的代码行仍然运行良好,但是我想为新人们看到以下问题时注意一些事项:

2019-10-30 Update: The above line of code is still working great but I wanted to note a couple of things for new people seeing this answer:

  • 您需要在末尾更改版本以匹配您的Spark版本,因此对于Spark 2.4.4,您需要:org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
  • 此jar spark-sql-kafka-0-10_2.12的最新版本对我来说崩溃了(Mac笔记本电脑),因此,如果在调用"readStream"还原为2.11时崩溃了.
  • You'll need to change the version at the end to match your Spark version, so for Spark 2.4.4 you'll need: org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
  • The newest version of this jar spark-sql-kafka-0-10_2.12is crashing for me (Mac Laptop), so if you get a crash when invoking 'readStream' revert to 2.11.

这篇关于将Jar添加到独立的pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆