获取 Spark DataFrame 中两个日期之间的所有日期 [英] get all the dates between two dates in Spark DataFrame

查看：64 发布时间：2021/11/14 21:36:17 pyspark apache-spark-sql

本文介绍了获取 Spark DataFrame 中两个日期之间的所有日期的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 DF，其中有 bookingDt 和 arrivalDt 列.我需要找到这两个日期之间的所有日期.

I have a DF in which I have bookingDt and arrivalDt columns. I need to find all the dates between these two dates.

示例代码:

df = spark.sparkContext.parallelize(
            [Row(vyge_id=1000, bookingDt='2018-01-01', arrivalDt='2018-01-05')]).toDF()
diffDaysDF = df.withColumn("diffDays", datediff('arrivalDt', 'bookingDt'))
diffDaysDF.show()

代码输出:

+----------+----------+-------+--------+
| arrivalDt| bookingDt|vyge_id|diffDays|
+----------+----------+-------+--------+
|2018-01-05|2018-01-01|   1000|       4|
+----------+----------+-------+--------+

我尝试的是找到两个日期之间的天数，并使用 timedelta 函数和 explode 计算所有日期.

What I tried was finding the number of days between two dates and calculate all the dates using timedelta function and explode it.

dateList = [str(bookingDt + timedelta(i)) for i in range(diffDays)]

预期输出:

基本上，我需要构建一个 DF，其中包含 bookingDt 和 arrivalDt 之间的每个日期的记录.

Basically, I need to build a DF with a record for each date in between bookingDt and arrivalDt, inclusive.

+----------+----------+-------+----------+
| arrivalDt| bookingDt|vyge_id|txnDt     |
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-01|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-02|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-03|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-04|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01|   1000|2018-01-05|
+----------+----------+-------+----------+

推荐答案

只要您使用的是 Spark 2.1 或更高版本，您就可以利用我们可以使用列值作为参数使用 pyspark.sql.functions.expr():

As long as you're using Spark version 2.1 or higher, you can exploit the fact that we can use column values as arguments when using pyspark.sql.functions.expr():

创建一个重复逗号的虚拟字符串，其长度等于 diffDays
在','上分割这个字符串，把它变成一个diffDays大小的数组
使用pyspark.sql.functions.posexplode() 分解这个数组及其索引
最后使用pyspark.sql.functions.date_add() 将索引值天数添加到bookingDt

Create a dummy string of repeating commas with a length equal to diffDays
Split this string on ',' to turn it into an array of size diffDays
Use pyspark.sql.functions.posexplode() to explode this array along with its indices
Finally use pyspark.sql.functions.date_add() to add the index value number of days to the bookingDt

代码:

import pyspark.sql.functions as f

diffDaysDF.withColumn("repeat", f.expr("split(repeat(',', diffDays), ',')"))\
    .select("*", f.posexplode("repeat").alias("txnDt", "val"))\
    .drop("repeat", "val", "diffDays")\
    .withColumn("txnDt", f.expr("date_add(bookingDt, txnDt)"))\
    .show()
#+----------+----------+-------+----------+
#| arrivalDt| bookingDt|vyge_id|     txnDt|
#+----------+----------+-------+----------+
#|2018-01-05|2018-01-01|   1000|2018-01-01|
#|2018-01-05|2018-01-01|   1000|2018-01-02|
#|2018-01-05|2018-01-01|   1000|2018-01-03|
#|2018-01-05|2018-01-01|   1000|2018-01-04|
#|2018-01-05|2018-01-01|   1000|2018-01-05|
#+----------+----------+-------+----------+

这篇关于获取 Spark DataFrame 中两个日期之间的所有日期的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

获取 Spark DataFrame 中两个日期之间的所有日期 [英] get all the dates between two dates in Spark DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

获取 Spark DataFrame 中两个日期之间的所有日期 [英] get all the dates between two dates in Spark DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭