将自定义 jar 添加到 jupyter notebook 中的 pyspark [英] Adding custom jars to pyspark in jupyter notebook

查看:78
本文介绍了将自定义 jar 添加到 jupyter notebook 中的 pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用带有 Pyspark 的 Jupyter 笔记本 和以下 docker 图像:Jupyter all-spark-notebook

现在我想编写一个 pyspark 流应用程序,它使用来自 Kafka 的消息.在 Spark-Kafka 集成指南中,他们描述了如何部署这样的一个使用 spark-submit 的应用程序(它需要链接一个外部 jar - 解释在 3. 部署 中).但是因为我使用的是 Jupyter notebook,所以我从来没有真正运行过 spark-submit 命令,我假设如果我按下执行,它会在后面运行.

spark-submit 命令中,您可以指定一些参数,其中之一是 -jars,但我不清楚如何从笔记本(或外部通过环境变量?).我假设我可以通过 SparkConfSparkContext 对象动态链接这个外部 jar.有没有人有关于如何从笔记本正确执行链接的经验?

解决方案

我已经设法在从 all-spark 容器运行的 jupyter notebook 中让它工作.

我在 jupyterhub 中启动了一个 python3 笔记本并覆盖了 PYSPARK_SUBMIT_ARGS 标志,如下所示.Kafka消费者库是从maven仓库下载的,放在我家目录/home/jovyan:

导入操作系统os.environ['PYSPARK_SUBMIT_ARGS'] ='--jars/home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'导入pyspark从 pyspark.streaming.kafka 导入 KafkaUtils从 pyspark.streaming 导入 StreamingContextsc = pyspark.SparkContext()ssc = StreamingContext(sc,1)broker = ""directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],{"metadata.broker.list": 经纪人})directKafkaStream.pprint()ssc.start()

注意:不要忘记环境变量中的pyspark-shell!

扩展:如果您想包含来自 spark-packages 的代码,您可以改用 --packages 标志.可以在 此处

I am using the Jupyter notebook with Pyspark with the following docker image: Jupyter all-spark-notebook

Now I would like to write a pyspark streaming application which consumes messages from Kafka. In the Spark-Kafka Integration guide they describe how to deploy such an application using spark-submit (it requires linking an external jar - explanation is in 3. Deploying). But since I am using Jupyter notebook I never actually run the spark-submit command, I assume it gets run in the back if I press execute.

In the spark-submit command you can specify some parameters, one of them is -jars, but it is not clear to me how I can set this parameter from the notebook (or externally via environment variables?). I am assuming I can link this external jar dynamically via the SparkConf or the SparkContext object. Has anyone experience on how to perform the linking properly from the notebook?

解决方案

I've managed to get it working from within the jupyter notebook which is running form the all-spark container.

I start a python3 notebook in jupyterhub and overwrite the PYSPARK_SUBMIT_ARGS flag as shown below. The Kafka consumer library was downloaded from the maven repository and put in my home directory /home/jovyan:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = 
  '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'

import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)

broker = "<my_broker_ip>"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],
                        {"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()

Note: Don't forget the pyspark-shell in the environment variables!

Extension: If you want to include code from spark-packages you can use the --packages flag instead. An example on how to do this in the all-spark-notebook can be found here

这篇关于将自定义 jar 添加到 jupyter notebook 中的 pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆