Spark SQL DSL中的UDF [英] UDF in Spark SQL DSL

查看:304
本文介绍了Spark SQL DSL中的UDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Spark SQL作业中尝试在纯SQL上使用DSL,但是我的UDF无法正常工作.

I am trying to use DSL over pure SQL in Spark SQL jobs but I cannot get my UDF works.

sqlContext.udf.register("subdate",(dateTime: Long)=>dateTime.toString.dropRight(6))

这不起作用

rdd1.toDF.join(rdd2.toDF).where("subdate(rdd1(date_time)) === subdate(rdd2(dateTime))")

我还想添加另一个连接条件,例如在此工作的纯SQL中

I also would like to add another join condition like in this working pure SQL

val results=sqlContext.sql("select * from rdd1 join rdd2 on rdd1.id=rdd2.idand subdate(rdd1.date_time)=subdate(rdd2.dateTime)")

感谢您的帮助

推荐答案

传递给where方法的SQL表达式至少由于以下几个原因是错误的:

SQL expression you pass to where method is incorrect at least for a few reasons:

  • ===Column方法,不是有效的SQL等同项.您应该使用单等号=
  • 括号表示法(table(column))不是引用SQL中列的有效方法.在这种情况下,它将被识别为函数调用. SQL使用点表示法(table.column)
  • 即使它既不是rdd1也不是rdd2都是有效的表别名
  • === is a Column method not a valid SQL equality. You should use single equality sign =
  • bracket notation (table(column)) is not a valid way to reference columns in SQL. In this context it will be recognized as a function call. SQL uses dot notation (table.column)
  • even if it was neither rdd1 nor rdd2 are valid table aliases

因为看起来列名是明确的,所以您可以简单地使用以下代码:

Since it looks like column names are unambiguous you could simply use following code:

df1.join(df2).where("subdate(date_time) = subdate(dateTime)")

如果不是这种情况,那么不先提供别名就不能使用点语法.参见例如 spark DataFrame作为"的用法.方法

If it wasn't the case using dot syntax wouldn't work without providing aliases first. See for example Usage of spark DataFrame "as" method

此外,在您完全使用原始SQL时,注册UDF最有意义.如果要使用DataFrame API,最好直接使用UDF:

Moreover registering UDFs makes sense mostly when you use raw SQL all the way. If you want to use DataFrame API it is better to use UDF directly:

import org.apache.spark.sql.functions.udf

val subdate = udf((dateTime: Long) => dateTime.toString.dropRight(6)) 

val df1 = rdd1.toDF
val df2 = rdd2.toDF

df1.join(df2, subdate($"date_time") === subdate($"dateTime"))

或者列名不明确:

df1.join(df2, subdate(df1("date_time")) === subdate(df2("date_time")))

最后,对于像这样的简单函数,组合内置表达式要比创建UDF更好.

Finally for simple functions like this it is better to compose built-in expressions than create UDFs.

这篇关于Spark SQL DSL中的UDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆