使用非原始数据类型创建UDF函数并在Spark-sql查询中使用:Scala [英] Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala

查看:93
本文介绍了使用非原始数据类型创建UDF函数并在Spark-sql查询中使用:Scala的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Scala中创建一个要在我的spark-sql查询中使用的函数.我的查询在蜂巢中工作正常,或者如果我在spark sql中给出相同的查询但我在多个处使用的相同查询的地方,所以我想将其创建为可重用的功能/方法,以便每当需要它时,我都可以调用它.我已经在我的Scala类中创建了以下函数.

  def date_part(date_column:Column)= {val m1:Column = month(to_date(from_unixtime(unix_timestamp(date_column,"dd-MM-yyyy")))))//给出值为01,02 ... etcm1个比赛{情况01 =>concat(concat(year(to_date(from_unixtime(unix_timestamp(date_column,"dd-MM- yyyy")))))-1,'-'),substr(year(to_date(from_unixtime(unix_timestamp(date_column,"dd-MM-yyyy)))),3,4))//ETC..情况_ =>其他逻辑"}} 

但显示多个错误.

  1. 对于01:

◾十进制整数文字可能没有前导零.(八进制语法已过时.)

类型不匹配;找到:需要Int(0):org.apache.spark.sql.Column.

  1. 对于'-':

类型不匹配;找到:Char('-')必需:org.apache.spark.sql.Column.

  1. 对于"substr":

未找到:值substr.

如果我也创建类型为列的任何简单函数,我将无法注册它,因为在列型格式中,对于所有原始数据类型(String,Long,Int)它的工作正常.但是在我的情况下,类型是列我无法做到这一点.有人可以指导我该怎么做.截至目前,我在堆栈溢出中发现我需要将此功能与df一起使用,然后需要将此df转换为临时表.请以其他任何替代方法指导我,以便在不更改现有代码的情况下,我可以使用此功能.

解决方案

尝试以下代码.

  scala>导入org.joda.time.format._导入org.joda.time.format._斯卡拉>spark.udf.register("datePart",(date:String)=> gt; DateTimeFormat.forPattern("MM-dd-yyyy").parseDateTime(date).toString(DateTimeFormat.forPattern("MMyyyy"))))res102:org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(< function1>,StringType,Some(List(StringType))))斯卡拉>spark.sql("选择datePart(" 03-01-2019)作为datepart"").show+ -------- +| datepart |+ -------- +|032019 |+ -------- + 

I am creating one function in scala which i want to use in my spark-sql query.my query is working fine in hive or if i am giving the same query in spark sql but the same query i'm using at multiple places so i want to create it as reusable function/method so whenever its required i can just call it. I have created below function in my scala class.

def date_part(date_column:Column) = {
    val m1: Column = month(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))) //give  value as 01,02...etc

    m1 match {
        case 01 => concat(concat(year(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM- yyyy"))))-1,'-'),substr(year(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))),3,4))
        //etc..
        case _ => "some other logic"
    }
}

but its showing multiple error.

  1. For 01:

◾Decimal integer literals may not have a leading zero. (Octal syntax is obsolete.)

◾type mismatch; found : Int(0) required: org.apache.spark.sql.Column.

  1. For '-':

type mismatch; found : Char('-') required: org.apache.spark.sql.Column.

  1. For 'substr':

not found: value substr.

also that if I'm creating any simple function also with type as column I'm not able to register it as I'm getting error not possible in columnar format.and for all primitive data types(String,Long,Int) its working fine.But in my case type is column so I'm not able to do this.Can someone please guide me how should i do this.as of now I found on stack-overflow that i need use this function with df and then need to convert this df as temp table.can someone please guide me any other alternate way so without much changes in my existing code i can use this functionality.

解决方案

Try Below Code.

scala> import org.joda.time.format._
import org.joda.time.format._

scala> spark.udf.register("datePart",(date:String) => DateTimeFormat.forPattern("MM-dd-yyyy").parseDateTime(date).toString(DateTimeFormat.forPattern("MMyyyy")))
res102: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> spark.sql("""select datePart("03-01-2019") as datepart""").show
+--------+
|datepart|
+--------+
|  032019|
+--------+

这篇关于使用非原始数据类型创建UDF函数并在Spark-sql查询中使用:Scala的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆