获取 Spark DataFrame 中两个日期之间的所有日期 [英] get all the dates between two dates in Spark DataFrame
问题描述
我有一个 DF,其中有 bookingDt
和 arrivalDt
列.我需要找到这两个日期之间的所有日期.
I have a DF in which I have bookingDt
and arrivalDt
columns. I need to find all the dates between these two dates.
示例代码:
df = spark.sparkContext.parallelize(
[Row(vyge_id=1000, bookingDt='2018-01-01', arrivalDt='2018-01-05')]).toDF()
diffDaysDF = df.withColumn("diffDays", datediff('arrivalDt', 'bookingDt'))
diffDaysDF.show()
代码输出:
+----------+----------+-------+--------+
| arrivalDt| bookingDt|vyge_id|diffDays|
+----------+----------+-------+--------+
|2018-01-05|2018-01-01| 1000| 4|
+----------+----------+-------+--------+
我尝试的是找到两个日期之间的天数,并使用 timedelta
函数和 explode
计算所有日期.
What I tried was finding the number of days between two dates and calculate all the dates using timedelta
function and explode
it.
dateList = [str(bookingDt + timedelta(i)) for i in range(diffDays)]
预期输出:
基本上,我需要构建一个 DF,其中包含 bookingDt
和 arrivalDt
之间的每个日期的记录.
Basically, I need to build a DF with a record for each date in between bookingDt
and arrivalDt
, inclusive.
+----------+----------+-------+----------+
| arrivalDt| bookingDt|vyge_id|txnDt |
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-01|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-02|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-03|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-04|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-05|
+----------+----------+-------+----------+
推荐答案
只要您使用的是 Spark 2.1 或更高版本,您就可以利用我们可以使用 列值作为参数 使用 pyspark.sql.functions.expr()
:
As long as you're using Spark version 2.1 or higher, you can exploit the fact that we can use column values as arguments when using pyspark.sql.functions.expr()
:
- 创建一个重复逗号的虚拟字符串,其长度等于
diffDays
- 在
','
上分割这个字符串,把它变成一个diffDays
大小的数组 - 使用
pyspark.sql.functions.posexplode()
分解这个数组及其索引 - 最后使用
pyspark.sql.functions.date_add()
将索引值天数添加到bookingDt
- Create a dummy string of repeating commas with a length equal to
diffDays
- Split this string on
','
to turn it into an array of sizediffDays
- Use
pyspark.sql.functions.posexplode()
to explode this array along with its indices - Finally use
pyspark.sql.functions.date_add()
to add the index value number of days to thebookingDt
代码:
import pyspark.sql.functions as f
diffDaysDF.withColumn("repeat", f.expr("split(repeat(',', diffDays), ',')"))\
.select("*", f.posexplode("repeat").alias("txnDt", "val"))\
.drop("repeat", "val", "diffDays")\
.withColumn("txnDt", f.expr("date_add(bookingDt, txnDt)"))\
.show()
#+----------+----------+-------+----------+
#| arrivalDt| bookingDt|vyge_id| txnDt|
#+----------+----------+-------+----------+
#|2018-01-05|2018-01-01| 1000|2018-01-01|
#|2018-01-05|2018-01-01| 1000|2018-01-02|
#|2018-01-05|2018-01-01| 1000|2018-01-03|
#|2018-01-05|2018-01-01| 1000|2018-01-04|
#|2018-01-05|2018-01-01| 1000|2018-01-05|
#+----------+----------+-------+----------+
这篇关于获取 Spark DataFrame 中两个日期之间的所有日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!