在Spark DataFrame-Scala中格式化TimestampType [英] Format TimestampType in spark DataFrame- Scala

查看：654 发布时间：2020/9/4 20:44:21 scala apache-spark apache-spark-sql type-conversion

本文介绍了在Spark DataFrame-Scala中格式化TimestampType的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

虽然我尝试将字符串字段强制转换为Spark DataFrame中的TimestampType，但输出值的精度为微秒(yyyy-MM-dd HH:mm:ss.S).但是我需要格式为yyyy-MM-dd HH:mm:ss，即不包括微秒精度.另外，我想在写入镶木地板文件时将其另存为时间戳字段. 因此，我字段的数据类型应为格式为yyyy-MM-dd HH:mm:ss

While I try to cast a string field to a TimestampType in Spark DataFrame, the output value is coming with microsecond precision( yyyy-MM-dd HH:mm:ss.S). But I need the format to be yyyy-MM-dd HH:mm:ss ie., excluding the microsecond precision. Also, I want to save this as a time stamp field while writing into a parquet file. So the datatype of my field should be a timestamp of format yyyy-MM-dd HH:mm:ss

我尝试使用TimestampType作为

I tried using TimestampType as

col("column_A").cast(TimestampType)
or
col("column_A").cast("timestamp")

将字段强制转换为时间戳.它们能够将字段强制转换为时间戳，但精度为微秒.

to cast the field to timestamp. These are able to cast the field to timestamp but with the microsecond precision.

任何人都可以使用所需的格式规范来帮助将时间戳记数据类型保存到镶木地板文件中吗?
编辑
输入:

Can anyone help in saving the timestamp datatype to parquet file with the required format specification.
EDIT
Input:

val a = sc.parallelize(List(("a", "2017-01-01 12:02:00.0"), ("b", "2017-02-01 11:22:30"))).toDF("cola", "colb")
scala> a.withColumn("datetime", date_format(col("colb"), "yyyy-MM-dd HH:mm:ss")).show(false)
+----+---------------------+-------------------+
|cola|colb                 |datetime           |
+----+---------------------+-------------------+
|a   |2017-01-01 12:02:00.0|2017-01-01 12:02:00|
|b   |2017-02-01 11:22:30  |2017-02-01 11:22:30|
+----+---------------------+-------------------+


scala> a.withColumn("datetime", date_format(col("colb"), "yyyy-MM-dd HH:mm:ss")).printSchema
root
 |-- cola: string (nullable = true)
 |-- colb: string (nullable = true)
 |-- datetime: string (nullable = true)

在上面，我们获得了正确的时间戳格式，但是当我们打印模式时，datetime字段的类型为String，但是我在这里需要一个时间戳类型.

In the above, we are getting the right timestamp format, but when we print the Schema, the datetime field is of type String, but I need a timestamp type here.

现在，如果我尝试将字段强制转换为时间戳，则格式将设置为微秒精度，这是不希望的.

Now,if I attempt to cast the field to timestamp, the format is set to microsecond precision, which is not intended.

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> val a = sc.parallelize(List(("a", "2017-01-01 12:02:00.0"), ("b", "2017-02-01 11:22:30"))).toDF("cola", "colb")
a: org.apache.spark.sql.DataFrame = [cola: string, colb: string]

scala> a.withColumn("datetime", date_format(col("colb").cast(TimestampType), "yyyy-MM-dd HH:mm:ss").cast(TimestampType)).show(false)
+----+---------------------+---------------------+
|cola|colb                 |datetime             |
+----+---------------------+---------------------+
|a   |2017-01-01 12:02:00.0|2017-01-01 12:02:00.0|
|b   |2017-02-01 11:22:30  |2017-02-01 11:22:30.0|
+----+---------------------+---------------------+


scala> a.withColumn("datetime", date_format(col("colb").cast(TimestampType), "yyyy-MM-dd HH:mm:ss").cast(TimestampType)).printSchema
root
 |-- cola: string (nullable = true)
 |-- colb: string (nullable = true)
 |-- datetime: timestamp (nullable = true)

我期望格式为yyyy-MM-dd HH:mm:ss，并且字段的数据类型为timestamp 预先感谢

What I am expecting is for the format to be in yyyy-MM-dd HH:mm:ss and also the datatype of the field to be of timestamp Thanks in advance

在Spark DataFrame-Scala中格式化TimestampType [英] Format TimestampType in spark DataFrame- Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Spark DataFrame-Scala中格式化TimestampType [英] Format TimestampType in spark DataFrame- Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭