pyspark sql 为展开的行添加不同的 Qtr start_date、End_date [英] pyspark sql Add different Qtr start_date, End_date for exploded rows
问题描述
我有一个包含 start_date、end_date、sales_target 的数据框.我添加了代码来识别日期范围之间的季度数,因此能够使用一些 UDF 将 sales_target 拆分为季度数.
I have a dataframe which has start_date, end_date, sales_target. I have added code to identify the number of quarters between the date range, and accordingly able to split the sales_target across the number of quarters, using some a UDF.
df = sqlContext.createDataFrame([("2020-01-01","2020-12-31","15"),("2020-04-01","2020-12-31","11"),("2020-07-01","2020-12-31","3")], ["start_date","end_date","sales_target"])
+----------+----------+------------+
|start_date| end_date |sales_target|
+----------+----------+------------+
|2020-01-01|2020-12-31| 15|
|2020-04-01|2020-12-31| 11|
|2020-07-01|2020-12-31| 3|
+----------+----------+------------+
以下是计算季度数并使用 UDF 函数拆分 sales_target 后的数据框.
Following is the dataframe after calculating number of quarters, and splitting the sales_target using an UDF function.
spark.sql('select *, round(months_between(end_date, start_date)/3) as noq from df_temp').createOrReplaceTempView("df_temp")
spark.sql("select *, st_udf(cast(sales_target as integer), cast(noq as integer)) as sales_target from df_temp").createOrReplaceTempView("df_temp")
+----------+----------+--------+---------------+
|start_date| end_date |num_qtrs|sales_target_n |
+----------+----------+--------+---------------+
|2020-01-01|2020-12-31| 4| [4,4,4,3] |
|2020-04-01|2020-12-31| 3| [4,4,3] |
|2020-07-01|2020-12-31| 2| [2,1] |
+----------+----------+--------+---------------+
分解 sales_target 后,我能够得到以下结果:
After exploding the sales_target, I am able to get the following results:
+----------+----------+--------+-------------+---------------+------------------+
|start_date| end_date |num_qtrs|sales_target |sales_target_n | sales_target_new |
+----------+----------+--------+-------------+---------------+------------------+
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 4 |
|2020-01-01|2020-12-31| 4| 15 | [4,4,4,3] | 3 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 4 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 4 |
|2020-04-01|2020-12-31| 3| 11 | [4,4,3] | 3 |
|2020-07-01|2020-12-31| 2| 3 | [2,1] | 2 |
|2020-07-01|2020-12-31| 2| 3 | [2,1] | 1 |
+----------+----------+--------+-------------+---------------+------------------+
我需要帮助根据 num_qtrs 值为每一行添加不同的开始/结束日期.我需要获取如下数据框.
I need help in adding different start/end dates for each row depending upon the num_qtrs value. I need to get a dataframe as below.
+----------+----------+--------+-------------+------------------+--------------+--------------+
|start_date| end_date |num_qtrs|sales_target | sales_target_new |new_start_date| new_end_date |
+----------+----------+--------+-------------+------------------+--------------+--------------+
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-01-01 |2020-03-31 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-04-01 |2020-06-30 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 4 |2020-07-01 |2020-09-30 |
|2020-01-01|2020-12-31| 4| [4,4,4,3] | 3 |2020-10-01 |2020-12-31 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 4 |2020-04-01 |2020-06-30 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 4 |2020-07-01 |2020-09-30 |
|2020-04-01|2020-12-31| 3| [4,4,3] | 3 |2020-10-01 |2020-12-31 |
|2020-07-01|2020-12-31| 2| [2,1] | 2 |2020-07-01 |2020-09-30 |
|2020-07-01|2020-12-31| 2| [2,1] | 1 |2020-10-01 |2020-12-31 |
+----------+----------+--------+-------------+------------------+--------------+--------------+
有人可以帮助我使用 pyspark 代码示例来实现上述期望的结果.
Can someone please help me with pyspark code sample to achieve the above desired results.
序列错误的更新:谢谢
推荐答案
在应用 UDF 后考虑将下面作为输入数据框.
Considering below to be your input dataframe after applying your UDF.
输入:
+----------+----------+--------+--------------+
|start_date| end_date|num_qtrs|sales_target_n|
+----------+----------+--------+--------------+
|2020-01-01|2020-12-31| 4| [4, 4, 4, 3]|
|2020-04-01|2020-12-31| 3| [4, 4, 3]|
|2020-07-01|2020-12-31| 2| [2, 1]|
+----------+----------+--------+--------------+
您可以组合使用row_number
、add_months
和date_add
来获得您想要的输出,如下所示,
You can use a combination of row_number
, add_months
and date_add
to obtain your desired output, as shown below,
from pyspark.sql.functions import explode, row_number, expr
from pyspark.sql import Window
window = Window.partitionBy('start_date').orderBy(desc("sales_target_new"))
df.withColumn('sales_target_new', explode('sales_target_n')).\
withColumn('row_num', row_number().over(window)).\
withColumn('new_start_date', expr("add_months(start_date, (row_num-1) * 3)")).\
withColumn('new_end_date', expr("add_months(date_add(start_date, -1), row_num * 3)")).\
orderBy('start_date', 'row_num').show()
输出:
+----------+----------+--------+--------------+----------------+-------+--------------+------------+
|start_date| end_date|num_qtrs|sales_target_n|sales_target_new|row_num|new_start_date|new_end_date|
+----------+----------+--------+--------------+----------------+-------+--------------+------------+
|2020-01-01|2020-12-31| 4| [4, 4, 4, 3]| 4| 1| 2020-01-01| 2020-03-31|
|2020-01-01|2020-12-31| 4| [4, 4, 4, 3]| 4| 2| 2020-04-01| 2020-06-30|
|2020-01-01|2020-12-31| 4| [4, 4, 4, 3]| 4| 3| 2020-07-01| 2020-09-30|
|2020-01-01|2020-12-31| 4| [4, 4, 4, 3]| 3| 4| 2020-10-01| 2020-12-31|
|2020-04-01|2020-12-31| 3| [4, 4, 3]| 4| 1| 2020-04-01| 2020-06-30|
|2020-04-01|2020-12-31| 3| [4, 4, 3]| 4| 2| 2020-07-01| 2020-09-30|
|2020-04-01|2020-12-31| 3| [4, 4, 3]| 3| 3| 2020-10-01| 2020-12-31|
|2020-07-01|2020-12-31| 2| [2, 1]| 2| 1| 2020-07-01| 2020-09-30|
|2020-07-01|2020-12-31| 2| [2, 1]| 1| 2| 2020-10-01| 2020-12-31|
+----------+----------+--------+--------------+----------------+-------+--------------+------------+
您可以根据自己的要求修改window
.
You can modify the window
based on your requirements.
这篇关于pyspark sql 为展开的行添加不同的 Qtr start_date、End_date的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!