按时间戳写入Spark分区数据 [英] spark partition data writing by timestamp
问题描述
我有一些数据,其中的时间戳列字段很长且具有时代标准,我需要使用Spark Scala将数据以yyyy/mm/dd/hh的拆分格式保存
I have some data which has timestamp column field which is long and its epoch standard , I need to save that data in split-ted format like yyyy/mm/dd/hh using spark scala
data.write.partitionBy("timestamp").format("orc").save("mypath")
这只是按如下所示的时间戳划分数据
this is just splitting the data by timestamp like below
timestamp=1458444061098
timestamp=1458444061198
但我希望它像
└── YYYY
└── MM
└── DD
└── HH
推荐答案
您可以为此使用各种spark sql日期/时间函数.首先,添加一个从unix时间戳列创建的新日期类型列.
You can leverage various spark sql date/time functions for this. First, you add a new date type column created from the unix timestamp column.
val withDateCol = data
.withColumn("date_col", from_unixtime(col("timestamp", "YYYYMMddHH"))
此后,您可以将DF,DF,D,D,D栏添加到DF,然后按这些新栏进行分区.
After this, you can add year, month, day and hour columns to the DF and then partition by these new columns for the write.
withDateCol
.withColumn("year", year(col("date_col")))
.withColumn("month", month(col("date_col")))
.withColumn("day", dayofmonth(col("date_col")))
.withColumn("hour", hour(col("date_col")))
.drop("date_col")
.partitionBy("year", "month", "day", "hour")
.format("orc")
.save("mypath")
partitionBy子句中包含的列将不属于文件架构.
The columns included in the partitionBy clause wont be part of the file schema.
这篇关于按时间戳写入Spark分区数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!