库需要使用Python中的Spark(PySpark) [英] Libraries needed to Use Spark from Python (PySpark)

查看:125
本文介绍了库需要使用Python中的Spark(PySpark)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用来自Django的PySpark,并使用SparkSession连接到Spark主节点以在集群上执行作业.

I am using PySpark from Django and connect to a spark master node using SparkSession to execute a job on the cluster.

我的问题是我是否需要在本地计算机上完整安装spark?所有文档都让我安装了spark,然后将PySpark库添加到python路径.我不认为我需要全部〜500mb才能连接到现有集群.我正在尝试减轻Docker容器的重量.

My question is do I need a full install of spark on my local machine? All the documentation has me install spark and then add the PySpark libraries to the python path. I don't believe I need all ~500mb of that to connect to an existing cluster. I'm trying to lighten my docker containers.

感谢您的帮助.

推荐答案

尽管我尚未对其进行测试,但从Spark 2.1开始,PyPi可以提供PySpark(通过pip安装),专门针对您的情况.从文档:

Although I have not tested it, as of Spark 2.1, PySpark is available from PyPi (for installation via pip) precisely for cases such as yours. From the docs:

Spark的Python包装并不打算替代所有其他用例.此Python打包版本的Spark适用于与现有集群(Spark独立,YARN或Mesos)进行交互-但不包含设置您自己的独立Spark集群所需的工具.您可以从Apache Spark下载页面下载完整版的Spark.

The Python packaging for Spark is not intended to replace all of the other use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.

注意:如果您将其与Spark独立集群一起使用,则必须 确保版本(包括次要版本)匹配,否则您可以 遇到奇怪的错误

NOTE: If you are using this with a Spark standalone cluster you must ensure that the version (including minor version) matches or you may experience odd errors

这篇关于库需要使用Python中的Spark(PySpark)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆