Spark SQL DSL中的UDF [英] UDF in Spark SQL DSL

查看：304 发布时间：2020/9/4 21:21:24 sql apache-spark apache-spark-sql user-defined-functions dsl

本文介绍了Spark SQL DSL中的UDF的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在Spark SQL作业中尝试在纯SQL上使用DSL，但是我的UDF无法正常工作.

I am trying to use DSL over pure SQL in Spark SQL jobs but I cannot get my UDF works.

sqlContext.udf.register("subdate",(dateTime: Long)=>dateTime.toString.dropRight(6))

这不起作用

rdd1.toDF.join(rdd2.toDF).where("subdate(rdd1(date_time)) === subdate(rdd2(dateTime))")

我还想添加另一个连接条件，例如在此工作的纯SQL中

I also would like to add another join condition like in this working pure SQL

val results=sqlContext.sql("select * from rdd1 join rdd2 on rdd1.id=rdd2.idand subdate(rdd1.date_time)=subdate(rdd2.dateTime)")

感谢您的帮助

推荐答案

传递给where方法的SQL表达式至少由于以下几个原因是错误的:

SQL expression you pass to where method is incorrect at least for a few reasons:

===是Column方法，不是有效的SQL等同项.您应该使用单等号=
括号表示法(table(column))不是引用SQL中列的有效方法.在这种情况下，它将被识别为函数调用. SQL使用点表示法(table.column)
即使它既不是rdd1也不是rdd2都是有效的表别名

=== is a Column method not a valid SQL equality. You should use single equality sign =
bracket notation (table(column)) is not a valid way to reference columns in SQL. In this context it will be recognized as a function call. SQL uses dot notation (table.column)
even if it was neither rdd1 nor rdd2 are valid table aliases

因为看起来列名是明确的，所以您可以简单地使用以下代码:

Since it looks like column names are unambiguous you could simply use following code:

df1.join(df2).where("subdate(date_time) = subdate(dateTime)")

如果不是这种情况，那么不先提供别名就不能使用点语法.参见例如 spark DataFrame作为"的用法.方法

If it wasn't the case using dot syntax wouldn't work without providing aliases first. See for example Usage of spark DataFrame "as" method

此外，在您完全使用原始SQL时，注册UDF最有意义.如果要使用DataFrame API，最好直接使用UDF:

Moreover registering UDFs makes sense mostly when you use raw SQL all the way. If you want to use DataFrame API it is better to use UDF directly:

import org.apache.spark.sql.functions.udf

val subdate = udf((dateTime: Long) => dateTime.toString.dropRight(6)) 

val df1 = rdd1.toDF
val df2 = rdd2.toDF

df1.join(df2, subdate($"date_time") === subdate($"dateTime"))

或者列名不明确:

df1.join(df2, subdate(df1("date_time")) === subdate(df2("date_time")))

最后，对于像这样的简单函数，组合内置表达式要比创建UDF更好.

Finally for simple functions like this it is better to compose built-in expressions than create UDFs.

这篇关于Spark SQL DSL中的UDF的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark SQL DSL中的UDF [英] UDF in Spark SQL DSL

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark SQL DSL中的UDF [英] UDF in Spark SQL DSL

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭