无法在iPython中实例化Spark Context [英] Can't instantiate Spark Context in iPython

查看:364
本文介绍了无法在iPython中实例化Spark Context的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在mac上本地设置一个独立的spark实例并使用Python 3 API。要做到这一点,我已经完成了以下工作,
1.我已经下载并安装了Scala和Spark。
2.我设置了以下环境变量,

I'm trying to set up a stand alone instance of spark locally on a mac and use the Python 3 API. To do this I've done the following, 1. I've downloaded and installed Scala and Spark. 2. I've set up the following environment variables,

#Scala
export SCALA_HOME=$HOME/scala/scala-2.12.4
export PATH=$PATH:$SCALA_HOME/bin

#Spark
export SPARK_HOME=$HOME/spark/spark-2.2.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin

#Jupyter Python
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

#Python
alias python="python3"
alias pip="pip3"

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH

现在当我运行命令时

pyspark --master local[2]

在笔记本上输入sc,我得到以下信息,

And type sc on the notebook, I get the following,

SparkContext

Spark UI

Version
v2.2.1
Master
local[2]
AppName
PySparkShell

显然我的SparkContext没有初始化。我期待看到一个初始化的SparkContext对象。
我在这里做错了什么?

Clearly my SparkContext is not initialized. I'm expecting to see an initialized SparkContext object. What am I doing wrong here?

推荐答案

好吧,正如我所说的那样其他地方,设置 PYSPARK_DRIVER_PYTHON jupyter (或 ipython )是一个非常并且是错误的做法,这会导致下游无法预料的结果,例如当您尝试使用时 spark-submit 使用上述设置 ...

Well, as I have argued elsewhere, setting PYSPARK_DRIVER_PYTHON to jupyter (or ipython) is a really bad and plain wrong practice, which can lead to unforeseen outcomes downstream, such as when you try to use spark-submit with the above settings...

只有一种正确方式定制一个Jupyter笔记本以便使用其他语言(PySpark在这里),这是使用 Jupyter内核

There is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.

首先要做的是运行 jupyter kernelspec list 命令,获取机器中任何已有内核的列表;这是我的案例(Ubuntu)的结果:

The first thing to do is run a jupyter kernelspec list command, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):

$ jupyter kernelspec list
Available kernels:
  python2       /usr/lib/python2.7/site-packages/ipykernel/resources
  caffe         /usr/local/share/jupyter/kernels/caffe
  ir            /usr/local/share/jupyter/kernels/ir
  pyspark       /usr/local/share/jupyter/kernels/pyspark
  pyspark2      /usr/local/share/jupyter/kernels/pyspark2
  tensorflow    /usr/local/share/jupyter/kernels/tensorflow

第一个内核, python2 ,是与IPython一起出现的默认(很有可能这是你系统中唯一存在的);至于其余的,我还有2个Python内核( caffe & tensorflow ),一个R( ir )和两个PySpark内核分别用于Spark 1.6和Spark 2.0。

The first kernel, python2, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe & tensorflow), an R one (ir), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.

上面列表的条目是目录,每个目录包含一个名为 kernel.json 的文件。让我们看看这个文件的内容为我的 pyspark2 内核:

The entries of the list above are directories, and each one contains one single file, named kernel.json. Let's see the contents of this file for my pyspark2 kernel:

{
 "display_name": "PySpark (Spark 2.0)",
 "language": "python",
 "argv": [
  "/opt/intel/intelpython27/bin/python2",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
  "PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
  "PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
  "PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
 }
}

现在,最简单的方法就是是手动对我上面显示的内核进行必要的更改(仅路径)并将其保存在 ... / jupyter / kernels 目录的新子文件夹中(这样,它应该是v如果再次运行 jupyter kernelspec list 命令,则可以使用。如果你认为这种方法也是一种黑客攻击,那么我会同意你的意见,但这是 Jupyter文档(第12页):

Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernels directory (that way, it should be visible if you run again a jupyter kernelspec list command). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation (page 12):


然而,并不是很好修改kernelspecs的方法。一种方法使用 jupyter kernelspec list 来查找 kernel.json 文件,然后对其进行修改,例如:手工 kernels / python3 / kernel.json

However, there isn’t a great way to modify the kernelspecs. One approach uses jupyter kernelspec list to find the kernel.json file and then modifies it, e.g. kernels/python3/kernel.json, by hand.

如果你不喜欢已经有一个 ... / jupyter / kernels 文件夹,您仍然可以使用 jupyter kernelspec install - 没试过,但看看此SO答案

If you don't have already a .../jupyter/kernels folder, you can still install a new kernel using jupyter kernelspec install - haven't tried it, but have a look at this SO answer.

如果要将命令行参数传递给PySpark,则应添加 PYSPARK_SUBMIT_ARGS env 下的设置;例如,这是Spark 1.6.0各自内核文件的最后一行,我们仍然需要使用外部spark-csv包来读取CSV文件:

If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGS setting under env; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:

"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"

最后,不要忘记从bash配置文件中删除所有与PySpark / Jupyter相关的环境变量(只留下 SPARK_HOME PYSPARK_PYTHON 应该没问题。

Finally, don't forget to remove all the PySpark/Jupyter-related environment variables from your bash profile (leaving only SPARK_HOME and PYSPARK_PYTHON should be OK).

另一种可能性可能是使用 Apache Toree ,但我还没有尝试过。

Another possibility could be to use Apache Toree, but I haven't tried it myself yet.

这篇关于无法在iPython中实例化Spark Context的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆