Jupyter + EMR + Spark-从本地计算机上的Jupyter笔记本连接到EMR群集 [英] Jupyter + EMR + Spark - Connect to EMR cluster from Jupyter notebook on local machine
问题描述
我是PySpark和EMR的新手.
我正在尝试通过Jupyter笔记本访问在EMR群集上运行的Spark,但遇到错误.
I am new to PySpark and EMR.
I am trying to access Spark running on EMR cluster through Jupyter notebook, but running into errors.
我正在使用以下代码生成SparkSession:
I am generating SparkSession using following code:
spark = SparkSession.builder \
.master("local[*]")\
.appName("Carbon - SingleWell parallelization on Spark")\
.getOrCreate()
尝试以下操作以访问远程群集,但出错:
Tried following to access Remote cluster, but it errored out:
spark = SparkSession.builder \
.master("spark://<remote-emr-ec2-hostname>:7077")\
.appName("Carbon - SingleWell parallelization on Spark")\
.getOrCreate()
错误:
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:567)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
任何帮助解决此问题的方法将不胜感激.
Any help resolving this would be much appreciated.
推荐答案
EMR clusters have Jupyter and JupyterHub provisioned for you since EMR version 5.14.0.
Most likely, it is easier to tune those provisioned services up with some extra bootstrap actions than to wire up your local process to talk to the EMR master node.
这篇关于Jupyter + EMR + Spark-从本地计算机上的Jupyter笔记本连接到EMR群集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!