spark分区数据写入时间戳 [英] spark partition data writing by timestamp

查看：40 发布时间：2021/11/14 22:21:33 scala apache-spark apache-spark-sql

本文介绍了spark分区数据写入时间戳的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些数据具有很长的时间戳列字段及其纪元标准，我需要使用 spark scala 以 yyyy/mm/dd/hh 等拆分格式保存该数据

I have some data which has timestamp column field which is long and its epoch standard , I need to save that data in split-ted format like yyyy/mm/dd/hh using spark scala

data.write.partitionBy("timestamp").format("orc").save("mypath")

这只是按时间戳分割数据，如下所示

this is just splitting the data by timestamp like below

timestamp=1458444061098
timestamp=1458444061198

但我希望它像

└── YYYY
    └── MM
        └── DD
            └── HH

推荐答案

您可以为此利用各种 spark sql 日期/时间函数.首先，您添加一个从 unix 时间戳列创建的新日期类型列.

You can leverage various spark sql date/time functions for this. First, you add a new date type column created from the unix timestamp column.

val withDateCol = data
.withColumn("date_col", from_unixtime(col("timestamp", "YYYYMMddHH"))

此后，您可以将年、月、日和小时列添加到 DF，然后按这些新列进行分区以进行写入.

After this, you can add year, month, day and hour columns to the DF and then partition by these new columns for the write.

withDateCol
.withColumn("year", year(col("date_col")))
.withColumn("month", month(col("date_col")))
.withColumn("day", dayofmonth(col("date_col")))
.withColumn("hour", hour(col("date_col")))
.drop("date_col")
.partitionBy("year", "month", "day", "hour")
.format("orc")
.save("mypath")

partitionBy 子句中包含的列不会成为文件架构的一部分.

The columns included in the partitionBy clause wont be part of the file schema.

这篇关于spark分区数据写入时间戳的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

spark分区数据写入时间戳 [英] spark partition data writing by timestamp

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

spark分区数据写入时间戳 [英] spark partition data writing by timestamp

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭