Jupyter pyspark:没有名为pyspark的模块 [英] Jupyter pyspark : no module named pyspark

查看:408
本文介绍了Jupyter pyspark:没有名为pyspark的模块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Google确实充斥着这个问题的解决方案,但遗憾的是,即使尝试了所有可能性,也无法让它正常工作,所以请耐心等待,看看是否有什么东西袭击了你。

Google is literally littered with solutions to this problem, but unfortunately even after trying out all the possibilities, am unable to get it working, so please bear with me and see if something strikes you.

OS:MAC

Spark:1.6.3(2.10)

Spark : 1.6.3 (2.10)

Jupyter Notebook :4.4.0

Jupyter Notebook : 4.4.0

Python:2.7

Python : 2.7

Scala:2.12.1

Scala : 2.12.1

我能够成功安装和运行Jupyter笔记本。接下来,我尝试将其配置为使用Spark,为此我使用Apache Toree安装了spark解释器。现在当我尝试在笔记本中运行任何RDD操作时,抛出以下错误

I was able to successfully install and run Jupyter notebook. Next, i tried configuring it to work with Spark, for which i installed spark interpreter using Apache Toree. Now when i try running any RDD operation in notebook, following error is thrown

Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  /private/tmp/hadoop-xxxx/nm-local-dir/usercache/xxxx/filecache/33/spark-assembly-1.6.3-hadoop2.2.0.jar

已尝试的事情:
1.在.bash_profile中设置PYTHONPATH
2.能够在本地
的python-cli中导入'pyspark'3.尝试将解释器kernel.json更新为以下

Things already tried: 1. Set PYTHONPATH in .bash_profile 2. Am able to import 'pyspark' in python-cli on local 3. Have tried updating interpreter kernel.json to following

{
  "language": "python",
  "display_name": "Apache Toree - PySpark",
  "env": {
    "__TOREE_SPARK_OPTS__": "",
    "SPARK_HOME": "/Users/xxxx/Desktop/utils/spark",
    "__TOREE_OPTS__": "",
    "DEFAULT_INTERPRETER": "PySpark",
    "PYTHONPATH": "/Users/xxxx/Desktop/utils/spark/python:/Users/xxxx/Desktop/utils/spark/python/lib/py4j-0.9-src.zip:/Users/xxxx/Desktop/utils/spark/python/lib/pyspark.zip:/Users/xxxx/Desktop/utils/spark/bin",
  "PYSPARK_SUBMIT_ARGS": "--master local --conf spark.serializer=org.apache.spark.serializer.KryoSerializer",
    "PYTHON_EXEC": "python"
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/apache_toree_pyspark/bin/run.sh",
    "--profile",
    "{connection_file}"
  ]
}




  1. 甚至更新了解释器run.sh以显式加载py4j-0.9-src.zip和pyspark.zip文件。当打开PySpark笔记本并创建SparkContext时,我可以看到从本地上传的spark-assembly,py4j和pyspark包,但是当调用某个动作时,仍然找不到pyspark。


推荐答案

我们在jupyter的路径中创建一个文件startjupyter.sh,并将此文件中的所有环境设置保持为如上所述

We create a file startjupyter.sh in the path where we have jupyter and keep all environment setting in this file say as stated above

export SPARK_HOME=/home/gps/spark/spark-2.2.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

还提供错误路径和日志文件。
您还可以给出要执行笔记本的端口号。
保存文件并执行./startjupyter.sh
检查Jupyter.err文件,它将使令牌通过URL在线访问Jupyter笔记本。

give a path for error and log file also in it. you can also give the port number where you want to execute the notebook. Save the file and execute ./startjupyter.sh Check the Jupyter.err file it will give the token to access the Jupyter notebook online through url.

这篇关于Jupyter pyspark:没有名为pyspark的模块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆