什么是非类型化Scala UDF和类型化Scala UDF?他们有什么区别? [英] What are Untyped Scala UDF and Typed Scala UDF? What are their differences?

查看:61
本文介绍了什么是非类型化Scala UDF和类型化Scala UDF?他们有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用Spark 2.4一段时间了,最​​近几天才刚开始切换到Spark 3.0.切换到Spark 3.0以运行 udf((x:Int)=> x,IntegerType):

I've been using Spark 2.4 for a while and just started switching to Spark 3.0 in these last few days. I got this error after switching to Spark 3.0 for running udf((x: Int) => x, IntegerType):

Caused by: org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution;

解决方案是由Spark自己提出的,经过一段时间的搜索后,我进入了Spark迁移指南页面:

The solutions are proposed by Spark itself and after googling for a while I got to Spark Migration guide page:

在Spark 3.0中,默认情况下不允许使用org.apache.spark.sql.functions.udf(AnyRef,DataType).建议删除返回类型参数以自动切换到键入的Scala udf,或将spark.sql.legacy.allowUntypedScalaUDF设置为true以继续使用它.在Spark 2.4及更低版本中,如果org.apache.spark.sql.functions.udf(AnyRef,DataType)获得带有原始类型参数的Scala闭包,则如果输入值为null,则返回的UDF返回null.但是,在Spark 3.0中,如果输入值为null,则UDF返回Java类型的默认值.例如,如果列x为空,则val f = udf((x:Int)=> x,IntegerType),f($"x")在Spark 2.4及以下返回null,并在Spark 3.0中返回0.引入此行为更改是因为默认情况下,Spark 3.0是使用Scala 2.12构建的.

In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default. Remove the return type parameter to automatically switch to typed Scala udf is recommended, or set spark.sql.legacy.allowUntypedScalaUDF to true to keep using it. In Spark version 2.4 and below, if org.apache.spark.sql.functions.udf(AnyRef, DataType) gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") returns null in Spark 2.4 and below if column x is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.

来源:《火花迁移指南》

我注意到,我通常使用 function.udf API的通常方法是 udf(AnyRef,DataType),称为 UnTyped Scala UDF >,建议的解决方案称为 udf(AnyRef),称为 Typed Scala UDF .

I notice that my usual way of using function.udf API, which is udf(AnyRef, DataType), is called UnTyped Scala UDF and the proposed solution, which is udf(AnyRef), is called Typed Scala UDF.

  • 据我所知,第一个看起来比第二个更严格地键入,因为第二个没有明确定义其输出类型,而第二个没有明确定义,因此我对为什么将其称为UnTyped感到困惑.
  • 该函数还传递给 udf ,即(x:Int)=>x ,显然已定义了其输入类型,但Spark声称您正在使用未输入类型的Scala UDF,该类型不具有输入类型信息?
  • To my understanding, the first one looks more strictly typed than the second one where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.
  • Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?

我的理解正确吗?即使经过更深入的搜索,我仍然找不到任何材料来解释什么是UnTyped Scala UDF和什么是Typed Scala UDF.

Is my understanding correct? Even after more intensive searching I still can't find any material explaining what is UnTyped Scala UDF and what is Typed Scala UDF.

所以我的问题是:它们是什么?他们有什么区别?

推荐答案

在类型化的scala UDF中,UDF知道作为参数传递的列的类型,而在未类型化的scala UDF中,UDF不知道作为参数传递的列

在创建类型化的scala UDF时,从函数的参数和输出类型中推断作为UDF的参数和输出传递的列的类型,而在创建无类型的scala UDF时,对于参数或输出根本没有类型推断

When creating typed scala UDF, the types of columns passed as argument and output of the UDF are inferred from the function arguments and output types whereas when creating untyped scala UDF, there is not type inference at all, either for arguments or output.

令人困惑的是,当创建类型化的UDF时,类型是从函数推断出来的,而不是作为参数显式传递的.为了更加明确,您可以按如下所示编写类型化的UDF创建:

What can be confusing is that when creating typed UDF the types are inferred from function and not explicitly passed as argument. To be more explicit, you can write typed UDF creation as follow:

val my_typed_udf = udf[Int, Int]((x: Int) => Int)

现在,让我们看看您提出的两点.

Now, let's look at the two points you raised.

据我所知,第一个(例如 udf(AnyRef,DataType))看起来比第二个(例如 udf(AnyRef))更严格地键入第一个明确定义了其输出类型,第二个没有明确定义,因此我对为什么将其称为UnTyped感到困惑.

To my understanding, the first one (eg udf(AnyRef, DataType)) looks more strictly typed than the second one (eg udf(AnyRef)) where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.

根据spark函数scaladoc ,将第一个函数转换为UDF的 udf 函数的签名实际上是:

According to spark functions scaladoc, signatures of the udf functions that transform a function to an UDF are actually, for the first one:

def udf(f: AnyRef, dataType: DataType): UserDefinedFunction 

第二个:

def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction

因此第二个实际上比第一个更类型化,因为第二个考虑了作为参数传递的函数的类型,而第一个则删除了函数的类型.

So the second one is actually more typed than the first one, as the second one takes into account the type of the function passed as argument, whereas the first one erases the type of the function.

这就是为什么在第一个中需要定义返回类型的原因,因为spark需要此信息,但是由于擦除了返回类型而不能从作为参数传递的函数中推断出它,而在第二个中,您需要从返回类型中推断出函数作为参数传递.

That's why on the first one you need to define return type, because spark needs this information but can't infer it from function passed as argument as its return type is erased, whereas in the second one the return type is inferred from function passed as argument.

该函数还传递给 udf ,即(x:Int)=>x ,显然已定义了其输入类型,但Spark声称您正在使用未输入类型的Scala UDF,该类型不具有输入类型信息?

Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?

这里重要的不是函数,而是Spark如何通过此函数创建UDF.

What is important here is not the function, but how Spark creates an UDF from this function.

在两种情况下,都将定义要转换为UDF的函数的输入和返回类型,但是使用 udf(AnyRef,DataType)创建UDF时,这些类型将被删除并且不考虑在内.

In both cases, the function to be transformed to UDF has its input and return types defined, but those types are erased and not taken into account when creating UDF using udf(AnyRef, DataType).

这篇关于什么是非类型化Scala UDF和类型化Scala UDF?他们有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆