在 spark DataFrame 中格式化时间戳类型 - Scala [英] Format TimestampType in spark DataFrame- Scala
问题描述
当我尝试将字符串字段转换为 Spark DataFrame 中的 TimestampType 时,输出值具有微秒精度(yyyy-MM-dd HH:mm:ss.S
).但我需要格式为 yyyy-MM-dd HH:mm:ss
即,不包括微秒精度.另外,我想在写入镶木地板文件时将其保存为时间戳字段.所以我的字段的数据类型应该是格式 yyyy-MM-dd HH:mm:ss
While I try to cast a string field to a TimestampType in Spark DataFrame, the output value is coming with microsecond precision( yyyy-MM-dd HH:mm:ss.S
). But I need the format to be yyyy-MM-dd HH:mm:ss
ie., excluding the microsecond precision. Also, I want to save this as a time stamp field while writing into a parquet file.
So the datatype of my field should be a timestamp of format yyyy-MM-dd HH:mm:ss
我尝试使用 TimestampType 作为
I tried using TimestampType as
col("column_A").cast(TimestampType)
or
col("column_A").cast("timestamp")
将字段转换为时间戳.这些能够将字段转换为时间戳,但精度为微秒.
to cast the field to timestamp. These are able to cast the field to timestamp but with the microsecond precision.
任何人都可以帮助将时间戳数据类型保存到具有所需格式规范的镶木地板文件中.
编辑
输入:
Can anyone help in saving the timestamp datatype to parquet file with the required format specification.
EDIT
Input:
val a = sc.parallelize(List(("a", "2017-01-01 12:02:00.0"), ("b", "2017-02-01 11:22:30"))).toDF("cola", "colb")
scala> a.withColumn("datetime", date_format(col("colb"), "yyyy-MM-dd HH:mm:ss")).show(false)
+----+---------------------+-------------------+
|cola|colb |datetime |
+----+---------------------+-------------------+
|a |2017-01-01 12:02:00.0|2017-01-01 12:02:00|
|b |2017-02-01 11:22:30 |2017-02-01 11:22:30|
+----+---------------------+-------------------+
scala> a.withColumn("datetime", date_format(col("colb"), "yyyy-MM-dd HH:mm:ss")).printSchema
root
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
|-- datetime: string (nullable = true)
在上面,我们得到了正确的时间戳格式,但是当我们打印 Schema 时,datetime 字段是 String 类型,但我在这里需要一个时间戳类型.
In the above, we are getting the right timestamp format, but when we print the Schema, the datetime field is of type String, but I need a timestamp type here.
现在,如果我尝试将字段转换为时间戳,格式将设置为微秒精度,这不是我们想要的.
Now,if I attempt to cast the field to timestamp, the format is set to microsecond precision, which is not intended.
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val a = sc.parallelize(List(("a", "2017-01-01 12:02:00.0"), ("b", "2017-02-01 11:22:30"))).toDF("cola", "colb")
a: org.apache.spark.sql.DataFrame = [cola: string, colb: string]
scala> a.withColumn("datetime", date_format(col("colb").cast(TimestampType), "yyyy-MM-dd HH:mm:ss").cast(TimestampType)).show(false)
+----+---------------------+---------------------+
|cola|colb |datetime |
+----+---------------------+---------------------+
|a |2017-01-01 12:02:00.0|2017-01-01 12:02:00.0|
|b |2017-02-01 11:22:30 |2017-02-01 11:22:30.0|
+----+---------------------+---------------------+
scala> a.withColumn("datetime", date_format(col("colb").cast(TimestampType), "yyyy-MM-dd HH:mm:ss").cast(TimestampType)).printSchema
root
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
|-- datetime: timestamp (nullable = true)
我期望的是格式为 yyyy-MM-dd HH:mm:ss
以及字段的数据类型为 timestamp
提前致谢
What I am expecting is for the format to be in yyyy-MM-dd HH:mm:ss
and also the datatype of the field to be of timestamp
Thanks in advance
推荐答案
我认为您缺少的是时间戳/日期时间字段在本机存储中没有可读格式.格式是浮点数或 INT96,或某些取决于数据库的格式.格式化日期时间/时间戳以提高可读性一直是一个报告问题(IE,由准备显示数据的工具执行),这就是为什么您注意到当您为日期提供字符串格式时,它正确地将其转换为存储作为字符串.数据库 (spark) 只存储确切知道时间值是什么所需的内容.
I think what you are missing is that timestamp / datetime fields do NOT have readable formats in native storage. The format is float, or INT96, or some such depending on the database. Formatting a datetime / timestamp for readability has always been a reporting concern (I.E., performed by the tool preparing the data for display), which is why you noticed that when you supplied a string format for the date that it correctly converted it to be stored as a string. The database (spark) only stores exactly what it needs to know exactly what the time value is.
您可以指定时间戳值没有毫秒,即毫秒值为 0,但不能指定不显示毫秒.
You can specify that a timestamp value does not have milliseconds, I.E., a millisecond value of 0, but not that it should not display milliseconds.
这类似于在数字列上指定舍入行为(也是报告问题).
This would be akin to specifying rounding behavior on a numeric column (Also a reporting concern).
这篇关于在 spark DataFrame 中格式化时间戳类型 - Scala的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!