如何在DataFrames中将列类型从String更改为Date? [英] How to change the column type from String to Date in DataFrames?
问题描述
我有一个数据帧,其中有两列(C,D)被定义为字符串列类型,但这些列中的数据实际上是日期.例如,C列的日期为"01-APR-2015",D列的日期为"20150401",我想将它们更改为date列类型,但是我没有找到一种很好的方法.我看一下堆栈溢出,我需要在Spark SQL的DataFrame中将字符串列类型转换为Date列类型.日期格式可以是"2015年4月1日",我查看的是
I have a dataframe that have two columns (C, D) are defined as string column type, but the data in the columns are actually dates. for example column C has the date as "01-APR-2015" and column D as "20150401" I want to change these to date column type, but I didn't find a good way of doing that. I look at the stack overflow I need to convert the string column type to Date column type in Spark SQL's DataFrame. the date format can be "01-APR-2015" and I look at this post but it didn't have info relate to date
推荐答案
火花> = 2.2
您可以使用to_date
:
import org.apache.spark.sql.functions.{to_date, to_timestamp}
df.select(to_date($"ts", "dd-MMM-yyyy").alias("date"))
或to_timestamp
:
df.select(to_date($"ts", "dd-MMM-yyyy").alias("timestamp"))
带有中间unix_timestamp
调用.
火花< 2.2
从Spark 1.5开始,您可以使用unix_timestamp
函数将字符串解析为long,将其转换为时间戳并截断to_date
:
Since Spark 1.5 you can use unix_timestamp
function to parse string to long, cast it to timestamp and truncate to_date
:
import org.apache.spark.sql.functions.{unix_timestamp, to_date}
val df = Seq((1L, "01-APR-2015")).toDF("id", "ts")
df.select(to_date(unix_timestamp(
$"ts", "dd-MMM-yyyy"
).cast("timestamp")).alias("timestamp"))
注意:
根据Spark版本,您可能需要进行一些调整,原因是 SPARK-11724 :
Depending on a Spark version you this may require some adjustments due to SPARK-11724:
从整数类型到时间戳的转换会将源int视为以毫秒为单位.从时间戳转换为整数类型会以秒为单位创建结果.
Casting from integer types to timestamp treats the source int as being in millis. Casting from timestamp to integer types creates the result in seconds.
如果您使用未修补的版本,则unix_timestamp
输出需要乘以1000.
If you use unpatched version unix_timestamp
output requires multiplication by 1000.
这篇关于如何在DataFrames中将列类型从String更改为Date?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!