如何在没有重复记录的情况下爆炸数组 [英] How to explode an array without duplicate records

查看:81
本文介绍了如何在没有重复记录的情况下爆炸数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我有以下数据框,其中以数组列表作为列.

I have the following dataframe which has a array list as a column.

+--------------+------------+----------+----------+---+---------+-----------+----------+
customer_number|sales_target|start_date|end_date  |noq|cf_values|new_sdt    |new_edate |
+--------------+------------+----------+----------+---+---------------------+----------+
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|2020-01-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|2020-04-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|2020-07-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|2020-10-01 |2020-12-31|
+--------------+------------+----------+----------+---+---------------------+----------+

我需要有一列,每行具有一个cf_values,并在现有记录中添加列.如果我使用爆炸,将获得重复记录,因此最终将获得16条记录.

I need to have a column with one cf_values for each row, added withcolumn to existing record. If i use the explode, am getting dupicate records, so end up getting 16 records.

+--------------+------------+----------+----------+---+---------+------+-----------+----------+
customer_number|sales_target|start_date|end_date  |noq|cf_values|cf_new|new_sdt    |new_edate |
+--------------+------------+----------+----------+---+---------+------------------+----------+
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-01-01 |2019-12-31|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-01-01 |2019-12-31|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-01-01 |2019-12-31|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|3     |2020-01-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-04-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-04-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-04-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|3     |2020-04-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-07-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-07-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-07-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|3     |2020-07-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-10-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-10-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|4     |2020-10-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |[4,4,4,3]|3     |2020-10-01 |2020-12-30|
+--------------+------------+----------+----------+---+---------+------------------+----------+

预期结果: 4条具有4个不同cf_values的记录,即新的开始日期new_end_date.

Expected result: 4 records with 4 different cf_values, new start_date new_end_date.

+--------------+------------+----------+----------+---+------+-----------+----------+
customer_number|sales_target|start_date|end_date  |noq|cf_new|new_sdt    |new_edate |
+--------------+------------+----------+----------+---+------------------+----------+
|A011021       |15          |2020-01-01|2020-12-31|4  |4     |2020-01-01 |2020-03-31|
|A011021       |15          |2020-01-01|2020-12-31|4  |4     |2020-04-01 |2020-06-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |4     |2020-07-01 |2020-09-30|
|A011021       |15          |2020-01-01|2020-12-31|4  |3     |2020-10-01 |2020-12-31|
+--------------+------------+----------+----------+---+------------------+----------+

推荐答案

您无需展开数组,而是可以根据数组的位置从数组中选取值.

Instead of exploding the array, you can pick the values from the array based on it's position.

可以使用row_number动态生成此位置,如下所示.

This position can be dynamically generated using row_number as shown below.

from pyspark.sql.functions import row_number, expr
from pyspark.sql import Window

window = Window.partitionBy('customer_number').orderBy('new_sdt')

df.withColumn('row_num', row_number().over(window)).\
withColumn('cf_new', expr("cf_values[row_num - 1]")).\
drop('row_num').show()

输出:

+---------------+------------+----------+----------+---+------------+----------+----------+------+
|customer_number|sales_target|start_date|  end_date|noq|   cf_values|   new_sdt| new_edate|cf_new|
+---------------+------------+----------+----------+---+------------+----------+----------+------+
|        A011021|          15|2020-01-01|2020-12-31|  4|[4, 4, 4, 3]|2020-01-01|2020-03-31|     4|
|        A011021|          15|2020-01-01|2020-12-31|  4|[4, 4, 4, 3]|2020-04-01|2020-06-30|     4|
|        A011021|          15|2020-01-01|2020-12-31|  4|[4, 4, 4, 3]|2020-07-01|2020-09-30|     4|
|        A011021|          15|2020-01-01|2020-12-31|  4|[4, 4, 4, 3]|2020-10-01|2020-12-31|     3|
+---------------+------------+----------+----------+---+------------+----------+----------+------+

这篇关于如何在没有重复记录的情况下爆炸数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆