是否可以从Scala(spark)调用python函数 [英] Is it possible to call a python function from Scala(spark)
问题描述
我正在创建一个Spark作业,该作业要求使用python编写的函数将一列添加到数据框.其余处理使用Scala完成.
I am creating a spark job that requires a column to be added to a dataframe using a function written in python. The rest of the processing is done using Scala.
我找到了如何从pyspark调用Java/Scala函数的示例:
I have found examples of how to call a Java/Scala function from pyspark:
- https://community.hortonworks.com/questions/110844/is-it-possible-to-call-a-scala-function-in-pythonp.html
- http://aseigneurin.github.io/2016/09/01/spark-calling-scala-code-from-pyspark.html
我发现以其他方式发送数据的唯一示例是使用 pipe
The only examples I have found to send data the other way is using pipe
我是否可以将整个数据帧发送给python函数,让该函数操纵数据并添加其他列,然后将结果数据帧发送回调用的Scala函数?
Is it possible for me to send the entire dataframe to a python function, have the function manipulate the data and add additional columns and then send the resulting dataframe back to the calling Scala function?
如果这不可能,我当前的解决方案是运行pyspark进程并调用多个Scala函数来操纵数据框,这不是理想的选择.
If this isn't possible my current solution is to run a pyspark process and call multiple Scala functions to manipulate the dataframe, this isn't ideal.
推荐答案
只需从Python注册一个UDF,然后从Scala评估一个对DataFrame使用该函数的SQL语句-就像一个超级按钮一样,就尝试了;) https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook 是在Toree中运行笔记本的好方法,该笔记本混合了Scala和Python代码,并调用了相同的Spark上下文.
Just register a UDF from Python, and then from Scala evaluate an SQL statement that uses the function against a DataFrame - works like a charm, just tried it ;) https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook is a good way to run a notebook in Toree that mixes Scala and Python code calling the same Spark context.
这篇关于是否可以从Scala(spark)调用python函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!