将pyspark中的Python模块运送到其他节点 [英] Shipping Python modules in pyspark to other nodes
问题描述
如何将C编译的模块(例如python-Levenshtein)运送到 Spark 集群?
How can I ship C compiled modules (for example, python-Levenshtein) to each node in a Spark cluster?
我知道我可以使用独立的Python脚本(下面的示例代码)在Spark中运送Python文件:
I know that I can ship Python files in Spark using a standalone Python script (example code below):
from pyspark import SparkContext
sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'MyOtherFile.py'])
但是在没有'.py'的情况下,我该如何运送模块?
But in situations where there is no '.py', how do I ship the module?
推荐答案
如果您可以将模块打包为.egg
或.zip
文件,则在构造SparkContext时应该可以在pyFiles
中列出该模块(或者您可以稍后通过 sc.addPyFile ).
If you can package your module into a .egg
or .zip
file, you should be able to list it in pyFiles
when constructing your SparkContext (or you can add it later through sc.addPyFile).
对于使用setuptools的Python库,您可以运行python setup.py bdist_egg
来建立鸡蛋分发.
For Python libraries that use setuptools, you can run python setup.py bdist_egg
to build an egg distribution.
另一种选择是通过在每台计算机上使用pip/easy_install或通过在群集范围内的文件系统(例如NFS)上共享Python安装来在群集范围内安装该库.
Another option is to install the library cluster-wide, either by using pip/easy_install on each machine or by sharing a Python installation over a cluster-wide filesystem (like NFS).
这篇关于将pyspark中的Python模块运送到其他节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!