在 Pyspark 中将时间戳更改为 UTC 格式 [英] Change the timestamp to UTC format in Pyspark
问题描述
我有一个输入数据帧(ip_df),这个数据帧中的数据如下所示:
I have an input dataframe(ip_df), data in this dataframe looks like as below:
id timestamp_value
1 2017-08-01T14:30:00+05:30
2 2017-08-01T14:30:00+06:30
3 2017-08-01T14:30:00+07:30
我需要创建一个新的数据帧(op_df),其中我需要将时间戳值转换为 UTC 格式.所以最终的输出数据帧将如下所示:
I need to create a new dataframe(op_df), wherein i need to convert timestamp value to UTC format. So final output dataframe will look like as below:
id timestamp_value
1 2017-08-01T09:00:00+00:00
2 2017-08-01T08:00:00+00:00
3 2017-08-01T07:00:00+00:00
我想使用 PySpark 实现它.有人可以帮我吗?任何帮助都会得到帮助.
I want to achieve it using PySpark. Can someone please help me with it? Any help will be appericiated.
推荐答案
如果您绝对需要完全按照指示格式化时间戳,即时区表示为+00:00",我认为使用 UDF 作为已经建议是您的最佳选择.
If you absolutely need the timestamp to be formatted exactly as indicated, namely, with the timezone represented as "+00:00", I think using a UDF as already suggested is your best option.
但是,如果您可以容忍时区的稍微不同的表示,例如无论是+0000"(无冒号分隔符)还是Z",都可以在没有 UDF 的情况下执行此操作,根据数据集的大小,它的性能可能会更好.
However, if you can tolerate a slightly different representation of the timezone, e.g. either "+0000" (no colon separator) or "Z", it's possible to do this without a UDF, which may perform significantly better for you depending on the size of your dataset.
给定以下数据表示
+---+-------------------------+
|id |timestamp_value |
+---+-------------------------+
|1 |2017-08-01T14:30:00+05:30|
|2 |2017-08-01T14:30:00+06:30|
|3 |2017-08-01T14:30:00+07:30|
+---+-------------------------+
由:
l = [(1, '2017-08-01T14:30:00+05:30'), (2, '2017-08-01T14:30:00+06:30'), (3, '2017-08-01T14:30:00+07:30')]
ip_df = spark.createDataFrame(l, ['id', 'timestamp_value'])
其中 timestamp_value
是 String
,您可以执行以下操作(这里使用 to_timestamp 和 会话本地时区支持,这是在 Spark 2.2 中引入的):
where timestamp_value
is a String
, you could do the following (this uses to_timestamp and session local timezone support which were introduced in Spark 2.2):
from pyspark.sql.functions import to_timestamp, date_format
spark.conf.set('spark.sql.session.timeZone', 'UTC')
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssZ"
).alias('timestamp_value'))
产生:
+------------------------+
|timestamp_value |
+------------------------+
|2017-08-01T09:00:00+0000|
|2017-08-01T08:00:00+0000|
|2017-08-01T07:00:00+0000|
+------------------------+
或者,略有不同:
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssXXX"
).alias('timestamp_value'))
产生:
+--------------------+
|timestamp_value |
+--------------------+
|2017-08-01T09:00:00Z|
|2017-08-01T08:00:00Z|
|2017-08-01T07:00:00Z|
+--------------------+
这篇关于在 Pyspark 中将时间戳更改为 UTC 格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!