获取 Spark DataFrame 中两个日期之间的所有日期 [英] get all the dates between two dates in Spark DataFrame

查看:64
本文介绍了获取 Spark DataFrame 中两个日期之间的所有日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 DF,其中有 bookingDtarrivalDt 列.我需要找到这两个日期之间的所有日期.

I have a DF in which I have bookingDt and arrivalDt columns. I need to find all the dates between these two dates.

示例代码:

df = spark.sparkContext.parallelize(
            [Row(vyge_id=1000, bookingDt='2018-01-01', arrivalDt='2018-01-05')]).toDF()
diffDaysDF = df.withColumn("diffDays", datediff('arrivalDt', 'bookingDt'))
diffDaysDF.show()

代码输出:

+----------+----------+-------+--------+
| arrivalDt| bookingDt|vyge_id|diffDays|
+----------+----------+-------+--------+
|2018-01-05|2018-01-01|   1000|       4|
+----------+----------+-------+--------+

我尝试的是找到两个日期之间的天数,并使用 timedelta 函数和 explode 计算所有日期.

What I tried was finding the number of days between two dates and calculate all the dates using timedelta function and explode it.

dateList = [str(bookingDt + timedelta(i)) for i in range(diffDays)]

预期输出:

基本上,我需要构建一个 DF,其中包含 bookingDtarrivalDt 之间的每个日期的记录.

Basically, I need to build a DF with a record for each date in between bookingDt and arrivalDt, inclusive.

+----------+----------+-------+----------+
| arrivalDt| bookingDt|vyge_id|txnDt     |
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-01|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-02|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-03|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-04|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-05|
+----------+----------+-------+----------+

推荐答案

只要您使用的是 Spark 2.1 或更高版本,您就可以利用我们可以使用 列值作为参数 使用 pyspark.sql.functions.expr():

As long as you're using Spark version 2.1 or higher, you can exploit the fact that we can use column values as arguments when using pyspark.sql.functions.expr():

  • Create a dummy string of repeating commas with a length equal to diffDays
  • Split this string on ',' to turn it into an array of size diffDays
  • Use pyspark.sql.functions.posexplode() to explode this array along with its indices
  • Finally use pyspark.sql.functions.date_add() to add the index value number of days to the bookingDt

代码:

import pyspark.sql.functions as f

diffDaysDF.withColumn("repeat", f.expr("split(repeat(',', diffDays), ',')"))\
    .select("*", f.posexplode("repeat").alias("txnDt", "val"))\
    .drop("repeat", "val", "diffDays")\
    .withColumn("txnDt", f.expr("date_add(bookingDt, txnDt)"))\
    .show()
#+----------+----------+-------+----------+
#| arrivalDt| bookingDt|vyge_id|     txnDt|
#+----------+----------+-------+----------+
#|2018-01-05|2018-01-01|   1000|2018-01-01|
#|2018-01-05|2018-01-01|   1000|2018-01-02|
#|2018-01-05|2018-01-01|   1000|2018-01-03|
#|2018-01-05|2018-01-01|   1000|2018-01-04|
#|2018-01-05|2018-01-01|   1000|2018-01-05|
#+----------+----------+-------+----------+

这篇关于获取 Spark DataFrame 中两个日期之间的所有日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆