Spark如何与CPython互操作 [英] How does Spark interoperate with CPython

查看:110
本文介绍了Spark如何与CPython互操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用scala编写的Akka系统,需要依靠PandasNumpy调用一些Python代码,所以我不能只使用Jython.我注意到Spark在其工作程序节点上使用CPython,所以很好奇它如何执行Python代码以及该代码是否以某种可重用的形式存在.

I have an Akka system written in scala that needs to call out to some Python code, relying on Pandas and Numpy, so I can't just use Jython. I noticed that Spark uses CPython on its worker nodes, so I'm curious how it executes Python code and whether that code exists in some re-usable form.

推荐答案

此处介绍了PySpark体系结构

PySpark architecture is described here https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.

正如@Holden所说,Spark使用py4j从python访问JVM中的Java对象.但这只是一种情况-当驱动程序是用python编写的(该图的左侧)

As @Holden said Spark uses py4j to access Java objects in JVM from the python. But this is only one case - when driver program is written in python (left part of diagram there)

另一种情况(该图的右侧)-当Spark Worker启动Python进程并将序列化的Java对象发送到要处理的python程序,并接收输出时. Java对象被序列化为pickle格式-以便python可以读取它们.

The other case (the right part of the diagram) - when Spark Worker starts Python process and sends serialized Java objects to python program to be processed, and receives output. Java objects are serialized into pickle format - so python could read them.

看起来您正在寻找的是后一种情况.这里有一些Spark的scala核心链接,这些链接可能对您入门有用:

Looks like what you are looking for is the latter case. Here some links to the Spark's scala core that could be useful for you to get started:

  • Pyrolite 库,该库为Python的pickle协议提供Java接口-Spark用于序列化将Java对象转换为pickle格式.例如,对于访问PairRDD的Key和Value对的Key部分,需要进行这种转换.

  • Pyrolite library that provides Java interface to Python's pickle protocols - used by Spark to serialize Java objects into pickle format. For example such conversion is required for accessing Key part of Key, Value pairs for the PairRDD.

启动python进程并对其进行迭代的scala代码:

Scala code that starts python process and iterates with it: api/python/PythonRDD.scala

SerDeser实用程序,用于选择代码:

SerDeser utils that do picking of the code: api/python/SerDeUtil.scala

Python端: python/pyspark/worker. py

这篇关于Spark如何与CPython互操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆