pyspark sql 为展开的行添加不同的 Qtr start_date、End_date [英] pyspark sql Add different Qtr start_date, End_date for exploded rows

查看:33
本文介绍了pyspark sql 为展开的行添加不同的 Qtr start_date、End_date的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 start_date、end_date、sales_target 的数据框.我添加了代码来识别日期范围之间的季度数,因此能够使用一些 UDF 将 sales_target 拆分为季度数.

I have a dataframe which has start_date, end_date, sales_target. I have added code to identify the number of quarters between the date range, and accordingly able to split the sales_target across the number of quarters, using some a UDF.

df = sqlContext.createDataFrame([("2020-01-01","2020-12-31","15"),("2020-04-01","2020-12-31","11"),("2020-07-01","2020-12-31","3")], ["start_date","end_date","sales_target"])

+----------+----------+------------+
|start_date| end_date |sales_target|
+----------+----------+------------+
|2020-01-01|2020-12-31|          15|
|2020-04-01|2020-12-31|          11|
|2020-07-01|2020-12-31|           3|
+----------+----------+------------+

以下是计算季度数并使用 UDF 函数拆分 sales_target 后的数据框.

Following is the dataframe after calculating number of quarters, and splitting the sales_target using an UDF function.

spark.sql('select *, round(months_between(end_date, start_date)/3) as noq from df_temp').createOrReplaceTempView("df_temp")
spark.sql("select *, st_udf(cast(sales_target as integer), cast(noq as integer)) as sales_target from df_temp").createOrReplaceTempView("df_temp")
+----------+----------+--------+---------------+ 
|start_date| end_date |num_qtrs|sales_target_n | 
+----------+----------+--------+---------------+ 
|2020-01-01|2020-12-31|       4| [4,4,4,3]     | 
|2020-04-01|2020-12-31|       3| [4,4,3]       | 
|2020-07-01|2020-12-31|       2| [2,1]         | 
+----------+----------+--------+---------------+

分解 sales_target 后,我​​能够得到以下结果:

After exploding the sales_target, I am able to get the following results:

+----------+----------+--------+-------------+---------------+------------------+
|start_date| end_date |num_qtrs|sales_target |sales_target_n | sales_target_new | 
+----------+----------+--------+-------------+---------------+------------------+  
|2020-01-01|2020-12-31|       4|        15   |  [4,4,4,3]    | 4                |
|2020-01-01|2020-12-31|       4|        15   |  [4,4,4,3]    | 4                |
|2020-01-01|2020-12-31|       4|        15   |  [4,4,4,3]    | 4                |
|2020-01-01|2020-12-31|       4|        15   |  [4,4,4,3]    | 3                |
|2020-04-01|2020-12-31|       3|        11   |  [4,4,3]      | 4                | 
|2020-04-01|2020-12-31|       3|        11   |  [4,4,3]      | 4                |
|2020-04-01|2020-12-31|       3|        11   |  [4,4,3]      | 3                |
|2020-07-01|2020-12-31|       2|         3   |  [2,1]        | 2                |
|2020-07-01|2020-12-31|       2|         3   |  [2,1]        | 1                |
+----------+----------+--------+-------------+---------------+------------------+

我需要帮助根据 num_qtrs 值为每一行添加不同的开始/结束日期.我需要获取如下数据框.

I need help in adding different start/end dates for each row depending upon the num_qtrs value. I need to get a dataframe as below.

+----------+----------+--------+-------------+------------------+--------------+--------------+
|start_date| end_date |num_qtrs|sales_target | sales_target_new |new_start_date| new_end_date | 
+----------+----------+--------+-------------+------------------+--------------+--------------+  
|2020-01-01|2020-12-31|       4| [4,4,4,3]   | 4                |2020-01-01    |2020-03-31    |
|2020-01-01|2020-12-31|       4| [4,4,4,3]   | 4                |2020-04-01    |2020-06-30    |
|2020-01-01|2020-12-31|       4| [4,4,4,3]   | 4                |2020-07-01    |2020-09-30    |
|2020-01-01|2020-12-31|       4| [4,4,4,3]   | 3                |2020-10-01    |2020-12-31    |
|2020-04-01|2020-12-31|       3| [4,4,3]     | 4                |2020-04-01    |2020-06-30    | 
|2020-04-01|2020-12-31|       3| [4,4,3]     | 4                |2020-07-01    |2020-09-30    |
|2020-04-01|2020-12-31|       3| [4,4,3]     | 3                |2020-10-01    |2020-12-31    |
|2020-07-01|2020-12-31|       2| [2,1]       | 2                |2020-07-01    |2020-09-30    |
|2020-07-01|2020-12-31|       2| [2,1]       | 1                |2020-10-01    |2020-12-31    |
+----------+----------+--------+-------------+------------------+--------------+--------------+ 

有人可以帮助我使用 pyspark 代码示例来实现上述期望的结果.

Can someone please help me with pyspark code sample to achieve the above desired results.

序列错误的更新:谢谢

推荐答案

在应用 UDF 后考虑将下面作为输入数据框.

Considering below to be your input dataframe after applying your UDF.

输入:

+----------+----------+--------+--------------+
|start_date|  end_date|num_qtrs|sales_target_n|
+----------+----------+--------+--------------+
|2020-01-01|2020-12-31|       4|  [4, 4, 4, 3]|
|2020-04-01|2020-12-31|       3|     [4, 4, 3]|
|2020-07-01|2020-12-31|       2|        [2, 1]|
+----------+----------+--------+--------------+

您可以组合使用row_numberadd_monthsdate_add 来获得您想要的输出,如下所示,

You can use a combination of row_number, add_months and date_add to obtain your desired output, as shown below,

from pyspark.sql.functions import explode, row_number, expr
from pyspark.sql import Window

window = Window.partitionBy('start_date').orderBy(desc("sales_target_new"))

df.withColumn('sales_target_new', explode('sales_target_n')).\
withColumn('row_num', row_number().over(window)).\
withColumn('new_start_date', expr("add_months(start_date, (row_num-1) * 3)")).\
withColumn('new_end_date', expr("add_months(date_add(start_date, -1), row_num * 3)")).\
orderBy('start_date', 'row_num').show()

输出:

+----------+----------+--------+--------------+----------------+-------+--------------+------------+
|start_date|  end_date|num_qtrs|sales_target_n|sales_target_new|row_num|new_start_date|new_end_date|
+----------+----------+--------+--------------+----------------+-------+--------------+------------+
|2020-01-01|2020-12-31|       4|  [4, 4, 4, 3]|               4|      1|    2020-01-01|  2020-03-31|
|2020-01-01|2020-12-31|       4|  [4, 4, 4, 3]|               4|      2|    2020-04-01|  2020-06-30|
|2020-01-01|2020-12-31|       4|  [4, 4, 4, 3]|               4|      3|    2020-07-01|  2020-09-30|
|2020-01-01|2020-12-31|       4|  [4, 4, 4, 3]|               3|      4|    2020-10-01|  2020-12-31|
|2020-04-01|2020-12-31|       3|     [4, 4, 3]|               4|      1|    2020-04-01|  2020-06-30|
|2020-04-01|2020-12-31|       3|     [4, 4, 3]|               4|      2|    2020-07-01|  2020-09-30|
|2020-04-01|2020-12-31|       3|     [4, 4, 3]|               3|      3|    2020-10-01|  2020-12-31|
|2020-07-01|2020-12-31|       2|        [2, 1]|               2|      1|    2020-07-01|  2020-09-30|
|2020-07-01|2020-12-31|       2|        [2, 1]|               1|      2|    2020-10-01|  2020-12-31|
+----------+----------+--------+--------------+----------------+-------+--------------+------------+

您可以根据自己的要求修改window.

You can modify the window based on your requirements.

这篇关于pyspark sql 为展开的行添加不同的 Qtr start_date、End_date的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆