在Jupyter Notebook中将自定义jar添加到pyspark [英] Adding custom jars to pyspark in jupyter notebook

查看:450
本文介绍了在Jupyter Notebook中将自定义jar添加到pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将 Jupyter笔记本和Pyspark 与以下 docker镜像一起使用:

I am using the Jupyter notebook with Pyspark with the following docker image: Jupyter all-spark-notebook

现在,我想编写一个使用来自Kafka的消息的 pyspark流应用程序.在 Spark-Kafka集成指南中,他们描述了如何部署一个使用spark-submit的应用程序(它需要链接一个外部jar,其解释在 3.部署中).但是由于我使用的是Jupyter笔记本,所以我实际上从未运行过spark-submit命令,因此我假设如果按执行,它将在后面运行.

Now I would like to write a pyspark streaming application which consumes messages from Kafka. In the Spark-Kafka Integration guide they describe how to deploy such an application using spark-submit (it requires linking an external jar - explanation is in 3. Deploying). But since I am using Jupyter notebook I never actually run the spark-submit command, I assume it gets run in the back if I press execute.

spark-submit命令中,您可以指定一些参数,其中一个是-jars,但是我不清楚如何从笔记本(或通过环境变量在外部)设置此参数.我假设我可以通过SparkConfSparkContext对象动态链接此外部jar.有没有人有过如何从笔记本电脑正确执行链接的经验?

In the spark-submit command you can specify some parameters, one of them is -jars, but it is not clear to me how I can set this parameter from the notebook (or externally via environment variables?). I am assuming I can link this external jar dynamically via the SparkConf or the SparkContext object. Has anyone experience on how to perform the linking properly from the notebook?

推荐答案

我设法从jupyter笔记本中运行它,该笔记本从全火花容器运行.

I've managed to get it working from within the jupyter notebook which is running form the all-spark container.

我在jupyterhub中启动一个python3笔记本,并覆盖PYSPARK_SUBMIT_ARGS标志,如下所示. Kafka使用者库是从maven存储库下载的,并放在我的主目录/home/jovyan中:

I start a python3 notebook in jupyterhub and overwrite the PYSPARK_SUBMIT_ARGS flag as shown below. The Kafka consumer library was downloaded from the maven repository and put in my home directory /home/jovyan:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = 
  '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'

import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)

broker = "<my_broker_ip>"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],
                        {"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()

注意:请不要忘记环境变量中的pyspark-shell!

Note: Don't forget the pyspark-shell in the environment variables!

扩展名::如果要包含spark-packages中的代码,可以使用--packages标志. 此处

Extension: If you want to include code from spark-packages you can use the --packages flag instead. An example on how to do this in the all-spark-notebook can be found here

这篇关于在Jupyter Notebook中将自定义jar添加到pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆