如何在Spark SQL(DataFrame)的UDF中使用常量值 [英] How to use constant value in UDF of Spark SQL(DataFrame)
本文介绍了如何在Spark SQL(DataFrame)的UDF中使用常量值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含timestamp
的数据框.为了按时间(分钟,小时或天)汇总,我尝试过:
I have a dataframe which includes timestamp
. To aggregate by time(minute, hour, or day), I have tried as:
val toSegment = udf((timestamp: String) => {
val asLong = timestamp.toLong
asLong - asLong % 3600000 // period = 1 hour
})
val df: DataFrame // the dataframe
df.groupBy(toSegment($"timestamp")).count()
这很好.
我的问题是如何将UDF toSegment
概括为
My question is how to generalize the UDF toSegment
as
val toSegmentGeneralized = udf((timestamp: String, period: Int) => {
val asLong = timestamp.toLong
asLong - asLong % period
})
我尝试了以下方法,但是没有用
I have tried as follows but it doesn't work
df.groupBy(toSegment($"timestamp", $"3600000")).count()
似乎找到了名为3600000
的列.
可能的解决方案是使用常量列,但我找不到它.
Possible solution is to use constant column but I couldn't find it.
推荐答案
您可以使用org.apache.spark.sql.functions.lit()
创建常量列:
You can use org.apache.spark.sql.functions.lit()
to create the constant column:
import org.apache.spark.sql.functions._
df.groupBy(toSegment($"timestamp", lit(3600000))).count()
这篇关于如何在Spark SQL(DataFrame)的UDF中使用常量值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文