在pyspark数据帧中的两个日期之间生成每月时间戳 [英] Generating monthly timestamps between two dates in pyspark dataframe

查看:41
本文介绍了在pyspark数据帧中的两个日期之间生成每月时间戳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些带有 "date" 列的 DataFrame,我正在尝试生成一个新的 DataFrame,其中包含 "date" 列.

I have some DataFrame with "date" column and I'm trying to generate a new DataFrame with all monthly timestamps between the min and max date from the "date" column.

解决方案之一如下:

month_step = 31*60*60*24

min_date, max_date = df.select(min_("date").cast("long"), max_("date").cast("long")).first()

df_ts = spark.range(
    (min_date / month_step) * month_step, 
    ((max_date / month_step) + 1) * month_step,
    month_step
).select(col("id").cast("timestamp").alias("yearmonth"))

df_formatted_ts = df_ts.withColumn(
    "yearmonth",
    f.concat(f.year("yearmonth"), f.lit('-'), format_string("%02d", f.month("yearmonth")))
).select('yearmonth')

df_formatted_ts.orderBy(asc('yearmonth')).show(150, False)

问题是我把 month_step 当作 31 天,它并不正确,因为有些月份有 30 天甚至 28 天.有可能以某种方式使其更精确吗?

The problem is that I took as a month_step 31 days and its not really correct because some of the months have 30 days and even 28 days. Is possible to somehow make it more precise?

注意:稍后我只需要值,所以我将忽略日期和时间.但无论如何,因为我在相当大的日期范围(2001 年和 2018 年之间)之间生成时间戳,时间戳会发生变化.

Just as a note: Later I only need year and month values so I will ignore day and time. But anyway because I'm generating timestamps between quite a big date range (between 2001 and 2018) the timestamps shifting.

这就是为什么有时会跳过几个月的原因.例如,此快照缺少 2010-02:

That's why sometimes some months will be skipped. For example, this snapshot is missing 2010-02:

|2010-01  |
|2010-03  |
|2010-04  |
|2010-05  |
|2010-06  |
|2010-07  |

我查了一下,从 2001 年到 2018 年只跳过了 3 个月.

I checked and there are just 3 months which were skipped from 2001 through 2018.

推荐答案

假设您有以下 DataFrame:

Suppose you had the following DataFrame:

data = [("2000-01-01","2002-12-01")]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()
#+----------+----------+
#|   minDate|   maxDate|
#+----------+----------+
#|2000-01-01|2002-12-01|
#+----------+----------+

您可以按照与 date,其中包含 minDatemaxDate 之间的所有月份="https://stackoverflow.com/a/51749877/5858851">我对这个问题.

You can add a column date with all of the months in between minDate and maxDate, by following the same approach as my answer to this question.

只需将 pyspark.sql.functions.datediff 替换为 pyspark.sql.functions.months_between,并使用 add_months 而不是 date_add:

Just replace pyspark.sql.functions.datediff with pyspark.sql.functions.months_between, and use add_months instead of date_add:

import pyspark.sql.functions as f

df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
    .withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
    .select("*", f.posexplode("repeat").alias("date", "val"))\
    .withColumn("date", f.expr("add_months(minDate, date)"))\
    .select('date')\
    .show(n=50)
#+----------+
#|      date|
#+----------+
#|2000-01-01|
#|2000-02-01|
#|2000-03-01|
#|2000-04-01|
# ...skipping some rows...
#|2002-10-01|
#|2002-11-01|
#|2002-12-01|
#+----------+

这篇关于在pyspark数据帧中的两个日期之间生成每月时间戳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆