组合两列,投射两个时间戳并从 df 中选择不会导致错误,但将一列投射到时间戳并选择会导致错误 [英] Combining two columns, casting two timestamp and selecting from df causes no error, but casting one column to timestamp and selecting causes error

查看:20
本文介绍了组合两列,投射两个时间戳并从 df 中选择不会导致错误,但将一列投射到时间戳并选择会导致错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试选择一个被强制转换为 unix_timestamp 的列,然后从数据帧中选择时间戳时,会出现 sparkanalysisexception 错误.请参阅下面的链接.

When I try to select a column that is cast to unix_timestamp and then timestamp from a dataframe there is a sparkanalysisexception error. See link below.

但是,当我组合两列,然后将组合转换为 unix_timestamp 和时间戳类型,然后从 df 中选择时,没有错误.

However, when I combine two columns, and then cast the combo to a unix_timestamp and then timestamp type and then select from a df there is no error.

错误:如何从日期字符串中提取年份?

没有错误

import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder().
  appName("myapp").master("local").getOrCreate()

case class Person(id: Int, date: String, time:String)
import spark.implicits._

val mydf: DataFrame = Seq(Person(1,"9/16/13", "11:11:11")).toDF()
//solution.show()
//column modificaton

val datecol: Column = mydf("date")
val timecol: Column = mydf("time")
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)

mydf.select(newcol).show()

结果

预期:错误火花分析,在mydf中找不到unix_timestamp(concat(....))

Results

Expected: Error-sparkanalysis, can't find unix_timestamp(concat(....)) in mydf

实际:

+------------------------------------------------------------------+
|CAST(unix_timestamp(concat(date,  , time), MM/dd/yy) AS TIMESTAMP)|
+------------------------------------------------------------------+
|                                              2013-09-16 00:00:...|

推荐答案

这些似乎并不是完全不同的情况.在错误的情况下,您有一个更改了列名的新数据框.见下文:-

These do not seem disparate cases. In the erroneous case, you had a new dataframe with changed column names. See below :-

val select_df: DataFrame = mydf.select(unix_timestamp(mydf("date"),"MM/dd/yy").cast(TimestampType))
select_df.select(year($"date")).show()

此处,select_df 数据框已将列名称从 date 更改为类似 cast(unix_timestamp(mydf("date"),"MM/dd/yy")) 作为时间戳

Here, select_df dataframe has changed column names from date to something like cast(unix_timestamp(mydf("date"),"MM/dd/yy")) as Timestamp

虽然在上面提到的情况下,当你说:-

While in the case mentioned above, you are just defining a new column when you say :-

val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)

然后您使用它从您的数据框中进行选择,从而给出预期的结果.

And then you use this to select from your dataframe and thus it gives out expected results.

希望这能让事情更清楚.

Hope this makes things clearer.

这篇关于组合两列,投射两个时间戳并从 df 中选择不会导致错误,但将一列投射到时间戳并选择会导致错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆