Spark(pyspark)难以在工作程序节点上调用统计方法 [英] Spark (pyspark) having difficulty calling statistics methods on worker node

查看:94
本文介绍了Spark(pyspark)难以在工作程序节点上调用统计方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从ipython笔记本运行pyspark时遇到库错误,我想在包含(key,list(int))的RDD的.mapValues操作中使用pyspark.mllib.stat中的Statistics.chiSqTest(obs)对.

I am hitting a library error when running pyspark (from an ipython-notebook), I want to use the Statistics.chiSqTest(obs) from pyspark.mllib.stat in a .mapValues operation on my RDD containing (key, list(int)) pairs.

在主节点上,如果我将RDD收集为地图,并迭代像这样的值,那么我就没有问题

On the master node, if I collect the RDD as a map, and iterate over the values like so I have no problems

keys_to_bucketed = vectors.collectAsMap()
keys_to_chi = {key:Statistics.chiSqTest(value).pValue for key,value in keys_to_bucketed.iteritems()}

但是如果我直接在RDD上执行相同操作,则会遇到问题

but if I do the same directly on the RDD I hit issues

keys_to_chi = vectors.mapValues(lambda vector: Statistics.chiSqTest(vector))
keys_to_chi.collectAsMap()

导致以下异常

Traceback (most recent call last):
  File "<ipython-input-80-c2f7ee546f93>", line 3, in chi_sq
  File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 238, in chiSqTest
    jmodel = callMLlibFunc("chiSqTest", _convert_to_vector(observed), expected)
  File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/common.py", line 127, in callMLlibFunc
api = getattr(sc._jvm.PythonMLLibAPI(), name)
AttributeError: 'NoneType' object has no attribute '_jvm'

我在Spark安装的早期就有一个问题,没有看到numpy,mac-osx有两个python安装(一个来自brew,一个来自OS),但是我认为我已经解决了.奇怪的是,这是spark安装附带的python库之一(我以前的问题是numpy).

I had an issue early on in my spark install not seeing numpy, with mac-osx having two python installs (one from brew and one from the OS) but I thought I had resolved that. Whats odd here is that this is one of the python libs that ships with the spark install (my previous issue had been with numpy).

  1. 安装详细信息
    • Max OSX Yosemite
    • Spark spark-1.4.0-bin-hadoop2.6
    • python通过spark-env.sh指定为
    • PYSPARK_PYTHON=/usr/bin/python
    • PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH:$EA_HOME/omnicat/src/main/python:$SPARK_HOME/python/
    • 别名ipython-spark-notebook ="IPYTHON_OPTS = \" notebook \"pyspark"
    • PYSPARK_SUBMIT_ARGS ='-num-executors 2 --executor-memory 4g --executor-cores 2'
    • 声明-x PYSPARK_DRIVER_PYTHON ="ipython"
  1. Install Details
    • Max OSX Yosemite
    • Spark spark-1.4.0-bin-hadoop2.6
    • python is specified via spark-env.sh as
    • PYSPARK_PYTHON=/usr/bin/python
    • PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH:$EA_HOME/omnicat/src/main/python:$SPARK_HOME/python/
    • alias ipython-spark-notebook="IPYTHON_OPTS=\"notebook\" pyspark"
    • PYSPARK_SUBMIT_ARGS='--num-executors 2 --executor-memory 4g --executor-cores 2'
    • declare -x PYSPARK_DRIVER_PYTHON="ipython"

推荐答案

正如您在注释中注意到的那样,工作节点上的sc为None. SparkContext仅在驱动程序节点上定义.

As you've noticed in your comment the sc on the worker nodes is None. The SparkContext is only defined on the driver node.

这篇关于Spark(pyspark)难以在工作程序节点上调用统计方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆