PySpark 1.5如何将时间戳从秒钟截断到最接近的分钟 [英] PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds

查看:1059
本文介绍了PySpark 1.5如何将时间戳从秒钟截断到最接近的分钟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PySpark。数据帧('canon_evt')中有一列('dt'),这是一个时间戳。我试图从DateTime值中删除秒。它最初是作为一个字符串从镶木地板读取的。然后我尝试通过

  canon_evt = canon_evt.withColumn('dt',to_date(canon_evt.dt))将其转换为Timestamp 
canon_evt = canon_evt.withColumn('dt',canon_evt.dt.astype('Timestamp'))

然后我想删除秒。我尝试'trunc','date_format',甚至试图将下面的部分连接在一起。我认为它需要一些地图和lambda组合,但我不确定Timestamp是否是一个适当的格式,以及是否可以摆脱秒。

  canon_evt = canon_evt.withColumn('dyt',year('dt')+' - '+ month('dt')+ 
' - '+ dayofmonth('dt' )+''+ hour('dt')+':'+ minute('dt'))

[Row(dt = datetime.datetime(2015,9,16,0,0) ,dyt = None)]


解决方案

转换为Unix时间戳基本的算术应该是窍门:从pyspark.sql导入行$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ unix_timestamp,round

df = sc.parallelize([
Row(dt ='1970-01-01 00:00:00'),
Row(dt ='2015 -09-16 05:39:46'),
Row(dt ='2015-09-16 05:40:46'),
Row(dt ='2016-03-05 02: 00:10'),
])。toDF()


## unix_timestamp将字符串转换为Uni x时间戳(bigint / long)
##(秒)。除以60,圆,乘以60,并投掷
##应该工作正常。
##
dt_truncated =((round(unix_timestamp(col(dt))/ 60)* 60)
.cast(timestamp))

df.withColumn(dt_truncated,dt_truncated).show(10,False)
## + ------------------- + ----- ---------------- +
## | dt | dt_truncated |
## + ------------------- + --------------------- +
## | 1970-01-01 00:00:00 | 1970-01-01 00:00:00.0 |
## | 2015-09-16 05:39:46 | 2015-09-16 05:40:00.0 |
## | 2015-09-16 05:40:46 | 2015-09-16 05:41:00.0 |
## | 2016-03-05 02:00:10 | 2016-03-05 02:00:00.0 |
## + ------------------- + --------------------- +


I am using PySpark. I have a column ('dt') in a dataframe ('canon_evt') that this a timestamp. I am trying to remove seconds from a DateTime value. It is originally read in from parquet as a String. I then try to convert it to Timestamp via

canon_evt = canon_evt.withColumn('dt',to_date(canon_evt.dt))
canon_evt= canon_evt.withColumn('dt',canon_evt.dt.astype('Timestamp'))

Then I would like to remove the seconds. I tried 'trunc', 'date_format' or even trying to concatenate pieces together like below. I think it requires some sort of map and lambda combination, but I'm not certain whether Timestamp is an appropriate format, and whether it's possible to get rid of seconds.

canon_evt = canon_evt.withColumn('dyt',year('dt') + '-' + month('dt') +
    '-' + dayofmonth('dt') + ' ' + hour('dt') + ':' + minute('dt'))

[Row(dt=datetime.datetime(2015, 9, 16, 0, 0),dyt=None)]

解决方案

Converting to Unix timestamps and basic arithmetics should to the trick:

from pyspark.sql import Row
from pyspark.sql.functions import col, unix_timestamp, round

df = sc.parallelize([
    Row(dt='1970-01-01 00:00:00'),
    Row(dt='2015-09-16 05:39:46'),
    Row(dt='2015-09-16 05:40:46'),
    Row(dt='2016-03-05 02:00:10'),
]).toDF()


## unix_timestamp converts string to Unix timestamp (bigint / long)
## in seconds. Divide by 60, round, multiply by 60 and cast
## should work just fine.
## 
dt_truncated = ((round(unix_timestamp(col("dt")) / 60) * 60)
    .cast("timestamp"))

df.withColumn("dt_truncated", dt_truncated).show(10, False)
## +-------------------+---------------------+
## |dt                 |dt_truncated         |
## +-------------------+---------------------+
## |1970-01-01 00:00:00|1970-01-01 00:00:00.0|
## |2015-09-16 05:39:46|2015-09-16 05:40:00.0|
## |2015-09-16 05:40:46|2015-09-16 05:41:00.0|
## |2016-03-05 02:00:10|2016-03-05 02:00:00.0|
## +-------------------+---------------------+

这篇关于PySpark 1.5如何将时间戳从秒钟截断到最接近的分钟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆