pyspark to_timestamp不包含毫秒 [英] pyspark to_timestamp does not include milliseconds

查看:706
本文介绍了pyspark to_timestamp不包含毫秒的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将时间戳列的格式设置为包括毫秒数,但没有成功.我如何格式化我的时间,使其看起来像这样-2019-01-04 11:09:21.152?

I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ?

我看了看文档并遵循SimpleDataTimeFormat,pyspark文档说to_timestamp函数正在使用它们.

I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function.

这是我的数据框.

+--------------------------+
|updated_date              |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+

我使用毫秒格式,但没有成功,如下所示

I use the millisecond format without any success as below

>>> df.select('updated_date').withColumn("updated_date_col2", 
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date              |updated_date_col2  |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+

我希望updated_date_col2的格式设置为2019-01-04 11:09:21.152

推荐答案

原因pyspark to_timestamp只能解析到几秒钟,而TimestampType可以保留毫秒.

Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.

以下解决方法可能会起作用:

Following workaround may work:

如果时间戳记模式包含S,则调用UDF以获取要在表达式中使用的字符串'INTERVAL MILLISECONDS'

If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression

ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"

# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))

# add milliseconds as inteval
if 'S' in timestamp_pattern:
   df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))

要获得INTERVAL 256毫秒的间隔,我们可以使用Java UDF:

To get INTERVAL 256 MILLISECONDS we may use a Java UDF:

df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))

内部UDF:getIntervalStringUDF(字符串timeString,字符串模式)

Inside UDF: getIntervalStringUDF(String timeString, String pattern)

  1. 使用SimpleDateFormat根据模式解析日期
  2. 使用模式'INTERVAL'SSS'MILLISECONDS'"以字符串形式返回格式化日期
  3. 针对解析/格式异常返回"INTERVAL 0 MILLISECONDS"

这篇关于pyspark to_timestamp不包含毫秒的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆