在pyspark数据框中的两个日期之间生成每月时间戳记 [英] Generating monthly timestamps between two dates in pyspark dataframe
问题描述
我有一些带有"date"
列的DataFrame,并且我试图从"date"
列中生成一个在最小和最大日期之间的所有每月时间戳记的新DataFrame.
I have some DataFrame with "date"
column and I'm trying to generate a new DataFrame with all monthly timestamps between the min and max date from the "date"
column.
解决方案之一如下:
month_step = 31*60*60*24
min_date, max_date = df.select(min_("date").cast("long"), max_("date").cast("long")).first()
df_ts = spark.range(
(min_date / month_step) * month_step,
((max_date / month_step) + 1) * month_step,
month_step
).select(col("id").cast("timestamp").alias("yearmonth"))
df_formatted_ts = df_ts.withColumn(
"yearmonth",
f.concat(f.year("yearmonth"), f.lit('-'), format_string("%02d", f.month("yearmonth")))
).select('yearmonth')
df_formatted_ts.orderBy(asc('yearmonth')).show(150, False)
问题在于我将month_step
视为31天,但这并不正确,因为某些月份中有30天甚至28天.可以通过某种方式使其更加精确吗?
The problem is that I took as a month_step
31 days and its not really correct because some of the months have 30 days and even 28 days. Is possible to somehow make it more precise?
请注意:稍后,我只需要 year 和 month 值,因此我将忽略日期和时间.但是无论如何,因为我正在生成一个很大的日期范围(2001年至2018年)之间的时间戳,所以时间戳发生了变化.
Just as a note: Later I only need year and month values so I will ignore day and time. But anyway because I'm generating timestamps between quite a big date range (between 2001 and 2018) the timestamps shifting.
这就是为什么有时会跳过几个月的原因.例如,此快照缺少2010-02:
That's why sometimes some months will be skipped. For example, this snapshot is missing 2010-02:
|2010-01 |
|2010-03 |
|2010-04 |
|2010-05 |
|2010-06 |
|2010-07 |
我检查了一下,从2001年到2018年仅跳过了3个月.
I checked and there are just 3 months which were skipped from 2001 through 2018.
推荐答案
假设您具有以下DataFrame:
Suppose you had the following DataFrame:
data = [("2000-01-01","2002-12-01")]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()
#+----------+----------+
#| minDate| maxDate|
#+----------+----------+
#|2000-01-01|2002-12-01|
#+----------+----------+
您可以按照与/a>.
只需将pyspark.sql.functions.datediff
替换为pyspark.sql.functions.months_between
,然后使用add_months
而不是date_add
:
Just replace pyspark.sql.functions.datediff
with pyspark.sql.functions.months_between
, and use add_months
instead of date_add
:
import pyspark.sql.functions as f
df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
#+----------+
#| date|
#+----------+
#|2000-01-01|
#|2000-02-01|
#|2000-03-01|
#|2000-04-01|
# ...skipping some rows...
#|2002-10-01|
#|2002-11-01|
#|2002-12-01|
#+----------+
这篇关于在pyspark数据框中的两个日期之间生成每月时间戳记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!