可从PySpark/Python调用的Spark(2.3+)Java函数 [英] Spark (2.3+) Java functions callable from PySpark/Python

查看:96
本文介绍了可从PySpark/Python调用的Spark(2.3+)Java函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

re Spark Doc 2.3:

re Spark Doc 2.3:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext.registerJavaFunction

registerJavaFunction(name,javaClassName,returnType = None)[源代码]

registerJavaFunction(name, javaClassName, returnType=None)[source]

将Java用户定义的函数注册为SQL函数.

Register a Java user-defined function as a SQL function.

除了名称和函数本身之外,还可以>可选地指定返回类型.如果未指定返回类型,我们将通过反射来推断它.

In addition to a name and the function itself, the return type can be >optionally specified. When the return type is not specified we would infer it via reflection.

参数:

name –用户定义函数的名称

name – name of the user-defined function

javaClassName – Java类的标准名称

javaClassName – fully qualified name of java class

returnType –已注册的Java函数的返回类型.该值可以是pyspark.sql.types.DataType对象或DDL格式的类型字符串.

returnType – the return type of the registered Java function. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string.


我的问题:


My question:

我想为Spark 2.3+建立一个包含大量UDF的库,这些库都用Java编写并且都可以从PySpark/Python访问.

I want to have a library of large number of UDFs, for Spark 2.3+, all written in Java and all accessible from PySpark/Python.

阅读上面链接的文档似乎表明,类和Java UDF函数之间存在一个一对一的映射(可从PySpark中的Spark-SQL调用).因此,如果我说10个Java UDF函数,则需要创建10个公共Java类,每个类具有1个UDF,以使它们可从PySpark/SQL调用.

Reading documentation which I linked above it appears that the there is a one to one mapping between a class and Java UDF function (callable from Spark-SQL in PySpark). So that if I have say 10 Java UDF functions then I need to create 10 public Java classes with 1 UDF per class to make them callable from PySpark/SQL.

这正确吗?

我可以创建1个公共Java类,并在1类中放置许多不同的UDF,并使所有UDF在Spark 2.3中可从PySpark调用吗?

Can I create 1 public Java class and place a number of different UDFs inside the 1 class and make all UDFs callable from PySpark in Spark 2.3 ?

这篇文章没有提供任何 Java 示例代码来帮助解决我的问题.看起来一切都在Scala中.我想要Java中的所有内容.我需要扩展一个类或实现接口来用Java吗?任何链接到要从PySpark-SQL调用的示例Java代码的链接.

This post does not provide any Java sample code to help with my question. It looks like it is all in Scala. I want it all in Java please. Do I need to extend a class or implement interface to do it in Java? Any links to sample Java code to be called from PySpark-SQL would be appreciated.

Spark:如何映射具有Scala或Java用户定义函数的Python?

推荐答案

因此,如果我说10个Java UDF函数,则需要创建10个公共Java类,每个类具有1个UDF,以使其可从PySpark/SQL调用.

So that if I have say 10 Java UDF functions then I need to create 10 public Java classes with 1 UDF per class to make them callable from PySpark/SQL.

这正确吗?

是的,这是正确的.但是,您可以:

Yes, that's correct. However you can:

  • 使用 UserDefinedFunction 并进行接口,如
  • Use UserDefinedFunction and interface it as shown in Spark: How to map Python with Scala or Java User Defined Functions?
  • Use UDFRegistration.register to register named udfs, and then just call org.apache.spark.sql.functions.callUDF through Py4j for each registered function.

这篇关于可从PySpark/Python调用的Spark(2.3+)Java函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆