串联数组pyspark [英] Concatenate array pyspark
问题描述
我有一个pyspark Dataframe Spark版本< 2.4
I have a pyspark Dataframe spark version < 2.4
示例数据框:
column_1<Array> | column_2 <Array> | column_3 <Array> | join_columns
----------------------------------------------------------------------------------------------------------------------------------------
["2345", "98576", "09857"] | null | ["9857"] | ["2345", "98576", "09857", "9857"]
----------------------------------------------------------------------------------------------------------------------------------------
null | ["87569", "9876"] | ["76586"] | ["87569", "9876","76586"]
----------------------------------------------------------------------------------------------------------------------------------------
["08798","07564"] | ["12345","5768","89687","7564"] | ["7564"] | ["08798","07564","12345","5768","89687", "7564"]
----------------------------------------------------------------------------------------------------------------------------------------
["03456", "09867"] | ["87586"] | [] | ["03456", "09867","87586"]
------------------------------------------------------------------------------------------------------------------------------------------
我想要to combine the 3 columns column_1, column_2 and column_3 in one "join_columns" and to drop the duplicates values.
我使用了concat,它组合了3列,但是仅当我在该列中只有一个值时才使用,因为可能是"concat"仅在字符串上起作用
I would like to combine the 3 columns column_1, column_2 and column_3 in one "join_columns" and to drop the duplicates values.
I used concat, it combined the 3 columns but only when I have only one value in the column, because may be "concat" is working only on Strings
df.withColumn("join_columns", concat(df.s, df.d)).drop_duplicates()
如何合并数组列的值? 谢谢
How can I combine the values of array columns ? Thank you
推荐答案
在Spark 2.4之前,您可以使用udf:
Before Spark 2.4, you can use a udf:
from pyspark.sql.functions import udf
@udf('array<string>')
def array_union(*arr):
return list(set([e.lstrip('0').zfill(5) for a in arr if isinstance(a, list) for e in a]))
df.withColumn('join_columns', array_union('column_1','column_2','column_3')).show(truncate=False)
注意:我们使用e.lstrip('0').zfill(5)
,因此对于每个数组项,我们首先删除前导0
,然后如果字符串的长度小于5,则在左侧填充0
Note: we use e.lstrip('0').zfill(5)
so that for each array item, we first remove the leading 0
and then fill 0
s to left if the length of string is less than 5.
这篇关于串联数组pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!