将日期字符串转换为时间戳,以亚秒级精度排序 [英] Convert date string to timestamp for sorting on sub-second precision

查看:237
本文介绍了将日期字符串转换为时间戳,以亚秒级精度排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在pySpark数据框中有一个列date,其日期采用以下格式:

I have a column date in a pySpark dataframe with dates in the following format:

2018-02-01T13:13:12.023507

我想将该列中的日期从字符串转换为时间戳(或者可以根据日期对它进行排序的东西).到目前为止,我已经尝试了以下方法:

I want to convert the dates in that column from string to timestamp (or something that I can sort it based on the date). So far I have tried the following:

new_df = df.withColumn(
    'date', 
    unix_timestamp("date", "YYYY-MM-DD'T'hh:mm:ss.s").cast("double").cast("timestamp")
)

还有这个

new_df = df.withColumn(
    'date',
    from_unixtime(
        unix_timestamp(col(('date')), "yyyy-MM-dd'T'hh:mm:ss.SSS"), 
        "yyyy-MM-dd'T'HH:mm:ss.SSS"
    )
)

还有这个

df_new = df.withColumn(
    'date1',
    F.to_timestamp("date", "yyyy-dd-MM'T'hh:mm:ss.s")
)

我尝试了在其他类似问题中发现的所有问题,但到目前为止没有任何效果.我也尝试了不同的格式yyyy-MM-dd'T'HH:mm:ss.ssssss,但没有成功.我想念什么?

I tried everything I found in other similar questions but so far nothing is working. I have also tried a different format yyyy-MM-dd'T'HH:mm:ss.ssssss with no success. What am I missing?

推荐答案

假设您具有以下DataFrame:

Suppose you had the following DataFrame:

df = spark.createDataFrame(
    [('2018-02-01T13:13:12.023507', ), ('2018-02-01T13:13:12.323507', )], 
    ["date"]
)
df.show(truncate=False)
#+--------------------------+
#|date                      |
#+--------------------------+
#|2018-02-01T13:13:12.023507|
#|2018-02-01T13:13:12.323507|
#+--------------------------+

unixtimestamp仅支持秒精度.如果只考虑基于日期的排序,则可以执行以下操作:

unixtimestamp only supports second precision. If you're only concerned with sorting based on the date, you can do the following:

from pyspark.sql.functions import col, unix_timestamp
df.withColumn(
    'new_date',
    unix_timestamp(col('date'), "yyyy-MM-dd'T'hh:mm:ss").cast("timestamp")
).sort('new_date').show(truncate=False)
#+--------------------------+---------------------+
#|date                      |new_date             |
#+--------------------------+---------------------+
#|2018-02-01T13:13:12.323507|2018-02-01 13:13:12.0|
#|2018-02-01T13:13:12.023507|2018-02-01 13:13:12.0|
#+--------------------------+---------------------+

但是由于这两行示例的日期和时间直到第二行都相同,因此此处的排序将不确定.

But since these two example rows have the same date and time up to the second, the sorting here will be indeterminate.

如果亚秒级的部分对您很重要,则可以编写自己的函数来处理.一种方法是拆分.上的date列,然后除以1000000.0以获得微秒.然后将其添加到unixtimestamp进行排序:

If the sub-second portion is important to you, you can write your own function to handle that. One way is to split the date column on the . and divide by 1000000.0 to get the microseconds. Then add this to the unixtimestamp for sorting:

from pyspark.sql.functions import split

df.withColumn(
    'order_column',
    unix_timestamp('date', "yyyy-MM-dd'T'hh:mm:ss") + split('date', "\.")[1]/1000000.0
).sort("order_column").show(truncate=False)
#+--------------------------+-------------------+
#|date                      |order_column       |
#+--------------------------+-------------------+
#|2018-02-01T13:13:12.023507|1.517508792023507E9|
#|2018-02-01T13:13:12.323507|1.517508792323507E9|
#+--------------------------+-------------------+

这篇关于将日期字符串转换为时间戳,以亚秒级精度排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆