如何安装 pyspark 以在独立脚本中使用? [英] How do I install pyspark for use in standalone scripts?

查看:21
本文介绍了如何安装 pyspark 以在独立脚本中使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 Spark 与 Python 结合使用.我从 下载 页面为 Hadoop 2 二进制分发版安装了 Spark 1.0.2.我可以在 Python 交互模式下运行快速入门示例,但现在我想编写一个使用 Spark 的独立 Python 脚本.快速入门文档 说导入 pyspark,但这不起作用,因为它不在我的 PYTHONPATH 中.

I'm am trying to use Spark with Python. I installed the Spark 1.0.2 for Hadoop 2 binary distribution from the downloads page. I can run through the quickstart examples in Python interactive mode, but now I'd like to write a standalone Python script that uses Spark. The quick start documentation says to just import pyspark, but this doesn't work because it's not on my PYTHONPATH.

我可以运行 bin/pyspark 并看到该模块安装在 SPARK_DIR/python/pyspark 下.我可以手动将它添加到我的 PYTHONPATH 环境变量中,但我想知道首选的自动化方法.

I can run bin/pyspark and see that the module is installed beneath SPARK_DIR/python/pyspark. I can manually add this to my PYTHONPATH environment variable, but I'd like to know the preferred automated method.

为独立脚本添加 pyspark 支持的最佳方法是什么?我在 Spark 安装目录下的任何地方都没有看到 setup.py .如何为依赖 Spark 的 Python 脚本创建 pip 包?

What is the best way to add pyspark support for standalone scripts? I don't see a setup.py anywhere under the Spark install directory. How would I create a pip package for a Python script that depended on Spark?

推荐答案

您可以按照您的建议手动设置 PYTHONPATH,这可能对您在本地安装上测试独立非交互式脚本时有用.

You can set the PYTHONPATH manually as you suggest, and this may be useful to you when testing stand-alone non-interactive scripts on a local installation.

然而,(py)spark 就是将您的作业分发到集群上的节点.每个集群都有一个定义管理器和许多参数的配置;设置它的细节在这里,包括一个简单的本地集群(这可能对测试功能有用).

However, (py)spark is all about distributing your jobs to nodes on clusters. Each cluster has a configuration defining a manager and many parameters; the details of setting this up are here, and include a simple local cluster (this may be useful for testing functionality).

在生产环境中,您将通过 spark-submit 将任务提交到 spark,这会将您的代码分发到集群节点,并为它们在这些节点上运行建立上下文.但是,您确实需要确保节点上的 python 安装具有所有必需的依赖项(推荐的方式),或者依赖项与您的代码一起传递(我不知道它是如何工作的).

In production, you will be submitting tasks to spark via spark-submit, which will distribute your code to the cluster nodes, and establish the context for them to run within on those nodes. You do, however, need to make sure that the python installations on the nodes have all the required dependencies (the recommended way) or that the dependencies are passed along with your code (I don't know how that works).

这篇关于如何安装 pyspark 以在独立脚本中使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆