可从PySpark/Python调用的Spark(2.3+)Java函数 [英] Spark (2.3+) Java functions callable from PySpark/Python
问题描述
re Spark Doc 2.3:
re Spark Doc 2.3:
registerJavaFunction(name,javaClassName,returnType = None)[源代码]
registerJavaFunction(name, javaClassName, returnType=None)[source]
将Java用户定义的函数注册为SQL函数.
Register a Java user-defined function as a SQL function.
除了名称和函数本身之外,还可以>可选地指定返回类型.如果未指定返回类型,我们将通过反射来推断它.
In addition to a name and the function itself, the return type can be >optionally specified. When the return type is not specified we would infer it via reflection.
参数:
name –用户定义函数的名称
name – name of the user-defined function
javaClassName – Java类的标准名称
javaClassName – fully qualified name of java class
returnType –已注册的Java函数的返回类型.该值可以是pyspark.sql.types.DataType对象或DDL格式的类型字符串.
returnType – the return type of the registered Java function. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string.
我的问题:
My question:
我想为Spark 2.3+建立一个包含大量UDF的库,这些库都用Java编写并且都可以从PySpark/Python访问.
I want to have a library of large number of UDFs, for Spark 2.3+, all written in Java and all accessible from PySpark/Python.
阅读上面链接的文档似乎表明,类和Java UDF函数之间存在一个一对一的映射(可从PySpark中的Spark-SQL调用).因此,如果我说10个Java UDF函数,则需要创建10个公共Java类,每个类具有1个UDF,以使它们可从PySpark/SQL调用.
Reading documentation which I linked above it appears that the there is a one to one mapping between a class and Java UDF function (callable from Spark-SQL in PySpark). So that if I have say 10 Java UDF functions then I need to create 10 public Java classes with 1 UDF per class to make them callable from PySpark/SQL.
这正确吗?
我可以创建1个公共Java类,并在1类中放置许多不同的UDF,并使所有UDF在Spark 2.3中可从PySpark调用吗?
Can I create 1 public Java class and place a number of different UDFs inside the 1 class and make all UDFs callable from PySpark in Spark 2.3 ?
这篇文章不没有提供任何 Java 示例代码来帮助解决我的问题.看起来一切都在Scala中.我想要Java中的所有内容.我需要扩展一个类或实现接口来用Java吗?任何链接到要从PySpark-SQL调用的示例Java代码的链接.
This post does not provide any Java sample code to help with my question. It looks like it is all in Scala. I want it all in Java please. Do I need to extend a class or implement interface to do it in Java? Any links to sample Java code to be called from PySpark-SQL would be appreciated.
Spark:如何映射具有Scala或Java用户定义函数的Python?
推荐答案
因此,如果我说10个Java UDF函数,则需要创建10个公共Java类,每个类具有1个UDF,以使其可从PySpark/SQL调用.
So that if I have say 10 Java UDF functions then I need to create 10 public Java classes with 1 UDF per class to make them callable from PySpark/SQL.
这正确吗?
是的,这是正确的.但是,您可以:
Yes, that's correct. However you can:
- 使用
UserDefinedFunction
并进行接口,如org.apache.spark.sql.functions.callUDF
到Py4j,用于每个已注册的功能.
- Use
UserDefinedFunction
and interface it as shown in Spark: How to map Python with Scala or Java User Defined Functions? - Use
UDFRegistration.register
to register namedudfs
, and then just callorg.apache.spark.sql.functions.callUDF
through Py4j for each registered function.
这篇关于可从PySpark/Python调用的Spark(2.3+)Java函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!