如何安装pyspark在独立脚本中使用? [英] How do I install pyspark for use in standalone scripts?

查看:412
本文介绍了如何安装pyspark在独立脚本中使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用的Spark与Python。我安装的是1.0.2星火从下载页面的Hadoop 2二进制分发版。我可以通过在Python交互模式快速入门实例中运行,但现在我想编写使用星火一个独立的Python脚本。该快速启动文档以说只需要导入 pyspark ,但这不起作用,因为这不是我的PYTHONPATH。

I'm am trying to use Spark with Python. I installed the Spark 1.0.2 for Hadoop 2 binary distribution from the downloads page. I can run through the quickstart examples in Python interactive mode, but now I'd like to write a standalone Python script that uses Spark. The quick start documentation says to just import pyspark, but this doesn't work because it's not on my PYTHONPATH.

我可以运行斌/ pyspark 键,看到模块安装下 SPARK_DIR /蟒蛇/ pyspark 。我可以手动将它添加到我的PYTHONPATH环境变量,但我想知道preferred自动化方法。

I can run bin/pyspark and see that the module is installed beneath SPARK_DIR/python/pyspark. I can manually add this to my PYTHONPATH environment variable, but I'd like to know the preferred automated method.

什么是增加对独立脚本 pyspark 支持的最佳方法是什么?我没有看到一个 setup.py 星火下的任何地方安装目录。我将如何创建一个PIP封装为依靠星火Python脚本?

What is the best way to add pyspark support for standalone scripts? I don't see a setup.py anywhere under the Spark install directory. How would I create a pip package for a Python script that depended on Spark?

推荐答案

您可以手动设置PYTHONPATH你的建议,并在本地安装测试独立非交互式脚本时,这可能是对你有用。

You can set the PYTHONPATH manually as you suggest, and this may be useful to you when testing stand-alone non-interactive scripts on a local installation.

不过,(PY)火花是所有关于发布你的工作对集群中的节点。每个集群都有一个定义经理和许多参数的配置;这一设置的细节这里,并配有简单的本地集群(这可以是用于测试功能是有用的)。

However, (py)spark is all about distributing your jobs to nodes on clusters. Each cluster has a configuration defining a manager and many parameters; the details of setting this up are here, and include a simple local cluster (this may be useful for testing functionality).

在生产中,您将提交任务,通过激发火花提交,这将分发code群集节点,并为他们在这些节点上中运行的环境。你这样做,不过,需要确保该节点上的蟒蛇设施拥有所有必需的依赖关系(推荐的方式),或者依赖与您code一起传递(我不知道如何工作的)。

In production, you will be submitting tasks to spark via spark-submit, which will distribute your code to the cluster nodes, and establish the context for them to run within on those nodes. You do, however, need to make sure that the python installations on the nodes have all the required dependencies (the recommended way) or that the dependencies are passed along with your code (I don't know how that works).

这篇关于如何安装pyspark在独立脚本中使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆