Spark如何与CPython互操作 [英] How does Spark interoperate with CPython
问题描述
我有一个用scala
编写的Akka系统,需要依靠Pandas
和Numpy
调用一些Python
代码,所以我不能只使用Jython.我注意到Spark在其工作程序节点上使用CPython,所以很好奇它如何执行Python代码以及该代码是否以某种可重用的形式存在.
I have an Akka system written in scala
that needs to call out to some Python
code, relying on Pandas
and Numpy
, so I can't just use Jython. I noticed that Spark uses CPython on its worker nodes, so I'm curious how it executes Python code and whether that code exists in some re-usable form.
推荐答案
PySpark architecture is described here https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.
正如@Holden所说,Spark使用py4j从python访问JVM中的Java对象.但这只是一种情况-当驱动程序是用python编写的(该图的左侧)
As @Holden said Spark uses py4j to access Java objects in JVM from the python. But this is only one case - when driver program is written in python (left part of diagram there)
另一种情况(该图的右侧)-当Spark Worker启动Python进程并将序列化的Java对象发送到要处理的python程序,并接收输出时. Java对象被序列化为pickle格式-以便python可以读取它们.
The other case (the right part of the diagram) - when Spark Worker starts Python process and sends serialized Java objects to python program to be processed, and receives output. Java objects are serialized into pickle format - so python could read them.
看起来您正在寻找的是后一种情况.这里有一些Spark的scala核心链接,这些链接可能对您入门有用:
Looks like what you are looking for is the latter case. Here some links to the Spark's scala core that could be useful for you to get started:
-
Pyrolite 库,该库为Python的pickle协议提供Java接口-Spark用于序列化将Java对象转换为pickle格式.例如,对于访问PairRDD的Key和Value对的Key部分,需要进行这种转换.
Pyrolite library that provides Java interface to Python's pickle protocols - used by Spark to serialize Java objects into pickle format. For example such conversion is required for accessing Key part of Key, Value pairs for the PairRDD.
Scala code that starts python process and iterates with it: api/python/PythonRDD.scala
SerDeser utils that do picking of the code: api/python/SerDeUtil.scala
Python端: python/pyspark/worker. py
这篇关于Spark如何与CPython互操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!