星火:如何使用Scala或Java用户定义函数映射的Python? [英] Spark: How to map Python with Scala or Java User Defined Functions?

查看:249
本文介绍了星火:如何使用Scala或Java用户定义函数映射的Python?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,例如,我的团队已经choosen Python作为参考语言与星火发展。但后来由于性能原因,我们要制定具体的Scala或Java特定librairies,以便将它们与我们的Python code(类似于Python的存根使用Scala或Java骷髅什么的)映射。

Let's say for instance that my team has choosen Python as the reference language to develop with Spark. But later for performance reasons, we would like to develop specific Scala or Java specific librairies in order to map them with our Python code (something similar to Python stubs with Scala or Java skeletons).

你不觉得是有可能的新界面定制的Python方法与某些引擎盖下的Scala或Java用户定义函数?

Don't you think is it possible to interface new customized Python methods with under the hood some Scala or Java User Defined Functions ?

推荐答案

我不会走那么远,说,这是支持的,但它肯定是可能的。目前PySpark提供所有的SQL函数只是围绕斯卡拉API一个包装。

I wouldn't go so far as to say it is supported but it is certainly possible. All SQL functions available currently in PySpark are simply a wrappers around Scala API.

让我们假设我想重用 GroupConcat UDAF我创建作为答案的 SPARK SQL替代MySQL的GROUP_CONCAT聚合函数以及它位于一个包 com.example.udaf

Lets assume I want to reuse GroupConcat UDAF I've created as an answer to SPARK SQL replacement for mysql GROUP_CONCAT aggregate function and it is located in a package com.example.udaf:

from pyspark.sql.column import Column, _to_java_column, _to_seq
from pyspark.sql import Row

row = Row("k", "v")
df = sc.parallelize([
    row(1, "foo1"), row(1, "foo2"), row(2, "bar1"), row(2, "bar2")]).toDF()

def groupConcat(col):
    """Group and concatenate values for a given column

    >>> df = sqlContext.createDataFrame([(1, "foo"), (2, "bar")], ("k", "v"))
    >>> df.select(groupConcat("v").alias("vs"))
    [Row(vs=u'foo,bar')]
    """
    sc = SparkContext._active_spark_context
    # It is possible to use java_import to avoid full package path
    _groupConcat = sc._jvm.com.example.udaf.GroupConcat.apply
    # Converting to Seq to match apply(exprs: Column*)
    return Column(_groupConcat(_to_seq(sc, [col], _to_java_column)))

df.groupBy("k").agg(groupConcat("v").alias("vs")).show()

## +---+---------+
## |  k|       vs|
## +---+---------+
## |  1|foo1,foo2|
## |  2|bar1,bar2|
## +---+---------+

有远得多领先强调对我的口味,但你可以看到它可以做到的。

There is far to much leading underscores for my taste but as you can see it can be done.

这篇关于星火:如何使用Scala或Java用户定义函数映射的Python?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆