Spark SQL DSL中的UDF [英] UDF in Spark SQL DSL
问题描述
我正在Spark SQL作业中尝试在纯SQL上使用DSL,但是我的UDF无法正常工作.
I am trying to use DSL over pure SQL in Spark SQL jobs but I cannot get my UDF works.
sqlContext.udf.register("subdate",(dateTime: Long)=>dateTime.toString.dropRight(6))
这不起作用
rdd1.toDF.join(rdd2.toDF).where("subdate(rdd1(date_time)) === subdate(rdd2(dateTime))")
我还想添加另一个连接条件,例如在此工作的纯SQL中
I also would like to add another join condition like in this working pure SQL
val results=sqlContext.sql("select * from rdd1 join rdd2 on rdd1.id=rdd2.idand subdate(rdd1.date_time)=subdate(rdd2.dateTime)")
感谢您的帮助
推荐答案
传递给where
方法的SQL表达式至少由于以下几个原因是错误的:
SQL expression you pass to where
method is incorrect at least for a few reasons:
-
===
是Column
方法,不是有效的SQL等同项.您应该使用单等号=
- 括号表示法(
table(column)
)不是引用SQL中列的有效方法.在这种情况下,它将被识别为函数调用. SQL使用点表示法(table.column
) - 即使它既不是
rdd1
也不是rdd2
都是有效的表别名
===
is aColumn
method not a valid SQL equality. You should use single equality sign=
- bracket notation (
table(column)
) is not a valid way to reference columns in SQL. In this context it will be recognized as a function call. SQL uses dot notation (table.column
) - even if it was neither
rdd1
norrdd2
are valid table aliases
因为看起来列名是明确的,所以您可以简单地使用以下代码:
Since it looks like column names are unambiguous you could simply use following code:
df1.join(df2).where("subdate(date_time) = subdate(dateTime)")
如果不是这种情况,那么不先提供别名就不能使用点语法.参见例如 spark DataFrame作为"的用法.方法
If it wasn't the case using dot syntax wouldn't work without providing aliases first. See for example Usage of spark DataFrame "as" method
此外,在您完全使用原始SQL时,注册UDF最有意义.如果要使用DataFrame
API,最好直接使用UDF:
Moreover registering UDFs makes sense mostly when you use raw SQL all the way. If you want to use DataFrame
API it is better to use UDF directly:
import org.apache.spark.sql.functions.udf
val subdate = udf((dateTime: Long) => dateTime.toString.dropRight(6))
val df1 = rdd1.toDF
val df2 = rdd2.toDF
df1.join(df2, subdate($"date_time") === subdate($"dateTime"))
或者列名不明确:
df1.join(df2, subdate(df1("date_time")) === subdate(df2("date_time")))
最后,对于像这样的简单函数,组合内置表达式要比创建UDF更好.
Finally for simple functions like this it is better to compose built-in expressions than create UDFs.
这篇关于Spark SQL DSL中的UDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!