如何从本地Jupyter笔记本到Docker容器中的Spark master运行PySpark作业? [英] How to run PySpark jobs from a local Jupyter notebook to a Spark master in a Docker container?

查看:301
本文介绍了如何从本地Jupyter笔记本到Docker容器中的Spark master运行PySpark作业?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个运行Apache Spark的Docker容器,其中有一个主服务器和一个从属服务器.我正在尝试从主机上的Jupyter笔记本提交作业.见下文:

I have a Docker container that's running Apache Spark with a master and a slave worker. I'm attempting to submit a job from a Jupyter notebook on the host machine. See below:

# Init
!pip install findspark
import findspark
findspark.init()


# Context setup
from pyspark import SparkConf, SparkContext
# Docker container is exposing port 7077
conf = SparkConf().setAppName('test').setMaster('spark://localhost:7077')
sc = SparkContext(conf=conf)
sc

# Execute step
import random
num_samples = 1000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)

执行步骤显示以下错误:

The execute step shows the following error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: 
    Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, 172.17.0.2, executor 0): 

    java.io.IOException: Cannot run program "/Users/omar/anaconda3/bin/python": error=2, No such file or directory

在我看来,该命令正在尝试在本地运行Spark作业,应将该命令发送到前面步骤中指定的Spark主数据库.通过Jupyter笔记本无法做到这一点吗?

It looks to me that the command is trying to run Spark job locally when it should be send it to the Spark master specified in the previous steps. Is this not possible through a Jupyter notebook?

我的容器基于 https://hub.docker.com/r/p7hb/docker-spark/,但我在/usr/bin/python3.6下安装了Python 3.6.

My container is based off of https://hub.docker.com/r/p7hb/docker-spark/ but I installed Python 3.6 under /usr/bin/python3.6.

推荐答案

在创建SparkContext之前,我必须执行以下 :

I had to do the following before I created the SparkContext:

import os
# Path on master/worker where Python is installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3.6'

一些研究表明,我需要通过以下方式将其添加到/usr/local/spark/conf/spark-env.sh中:

Some research showed that I need to add this to /usr/local/spark/conf/spark-env.sh via:

export PYSPARK_PYTHON='/usr/bin/python3.6'

但这不起作用.

这篇关于如何从本地Jupyter笔记本到Docker容器中的Spark master运行PySpark作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆