在Pyspark中将时间戳更改为UTC格式 [英] Change the timestamp to UTC format in Pyspark

查看:430
本文介绍了在Pyspark中将时间戳更改为UTC格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个输入数据框( ip_df ),该数据框中的数据如下所示:

I have an input dataframe(ip_df), data in this dataframe looks like as below:

id            timestamp_value
1       2017-08-01T14:30:00+05:30
2       2017-08-01T14:30:00+06:30
3       2017-08-01T14:30:00+07:30

我需要创建一个新的数据框( op_df ),其中我需要将时间戳记值转换为UTC格式.因此,最终的输出数据帧将如下所示:

I need to create a new dataframe(op_df), wherein i need to convert timestamp value to UTC format. So final output dataframe will look like as below:

id            timestamp_value
1       2017-08-01T09:00:00+00:00
2       2017-08-01T08:00:00+00:00
3       2017-08-01T07:00:00+00:00

我想使用PySpark实现它.有人可以帮我吗?任何帮助都将适用.

I want to achieve it using PySpark. Can someone please help me with it? Any help will be appericiated.

推荐答案

如果您绝对需要严格按照指示的格式格式化时间戳,即时区表示为"+00:00",我认为可以使用UDF作为已建议是您的最佳选择.

If you absolutely need the timestamp to be formatted exactly as indicated, namely, with the timezone represented as "+00:00", I think using a UDF as already suggested is your best option.

但是,如果您可以容忍时区的表示形式稍有不同,例如无论是"+0000"(没有冒号分隔符)还是"Z",都可以在没有UDF的情况下执行,这对您而言可能会更好,具体取决于数据集的大小.

However, if you can tolerate a slightly different representation of the timezone, e.g. either "+0000" (no colon separator) or "Z", it's possible to do this without a UDF, which may perform significantly better for you depending on the size of your dataset.

给出以下数据表示形式

+---+-------------------------+
|id |timestamp_value          |
+---+-------------------------+
|1  |2017-08-01T14:30:00+05:30|
|2  |2017-08-01T14:30:00+06:30|
|3  |2017-08-01T14:30:00+07:30|
+---+-------------------------+

由:

l = [(1, '2017-08-01T14:30:00+05:30'), (2, '2017-08-01T14:30:00+06:30'), (3, '2017-08-01T14:30:00+07:30')]
ip_df = spark.createDataFrame(l, ['id', 'timestamp_value'])

其中timestamp_valueString,您可以执行以下操作(此操作使用>会话本地时区支持":

where timestamp_value is a String, you could do the following (this uses to_timestamp and session local timezone support which were introduced in Spark 2.2):

from pyspark.sql.functions import to_timestamp, date_format
spark.conf.set('spark.sql.session.timeZone', 'UTC')
op_df = ip_df.select(
    date_format(
        to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"), 
        "yyyy-MM-dd'T'HH:mm:ssZ"
    ).alias('timestamp_value'))

产生:

+------------------------+
|timestamp_value         |
+------------------------+
|2017-08-01T09:00:00+0000|
|2017-08-01T08:00:00+0000|
|2017-08-01T07:00:00+0000|
+------------------------+

或者,略有不同:

op_df = ip_df.select(
    date_format(
        to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"), 
        "yyyy-MM-dd'T'HH:mm:ssXXX"
    ).alias('timestamp_value'))

产生:

+--------------------+
|timestamp_value     |
+--------------------+
|2017-08-01T09:00:00Z|
|2017-08-01T08:00:00Z|
|2017-08-01T07:00:00Z|
+--------------------+

这篇关于在Pyspark中将时间戳更改为UTC格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆