串联数组pyspark [英] Concatenate array pyspark

查看：75 发布时间：2020/9/4 21:14:25 pyspark apache-spark-sql

本文介绍了串联数组pyspark的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个pyspark Dataframe Spark版本< 2.4

I have a pyspark Dataframe spark version < 2.4

示例数据框:

column_1<Array>             |        column_2 <Array>           |  column_3 <Array>   |  join_columns
----------------------------------------------------------------------------------------------------------------------------------------      
["2345", "98576", "09857"]  |    null                           |  ["9857"]          |  ["2345", "98576", "09857", "9857"]
----------------------------------------------------------------------------------------------------------------------------------------
null                        | ["87569", "9876"]                 |  ["76586"]          |  ["87569", "9876","76586"]
----------------------------------------------------------------------------------------------------------------------------------------
["08798","07564"]           | ["12345","5768","89687","7564"]   |  ["7564"]          |  ["08798","07564","12345","5768","89687", "7564"]
----------------------------------------------------------------------------------------------------------------------------------------
["03456", "09867"]          |         ["87586"]                 |  []                 |  ["03456", "09867","87586"]
------------------------------------------------------------------------------------------------------------------------------------------

我想要to combine the 3 columns column_1, column_2 and column_3 in one "join_columns" and to drop the duplicates values. 我使用了concat，它组合了3列，但是仅当我在该列中只有一个值时才使用，因为可能是"concat"仅在字符串上起作用

I would like to combine the 3 columns column_1, column_2 and column_3 in one "join_columns" and to drop the duplicates values. I used concat, it combined the 3 columns but only when I have only one value in the column, because may be "concat" is working only on Strings

df.withColumn("join_columns", concat(df.s, df.d)).drop_duplicates()

如何合并数组列的值? 谢谢

How can I combine the values of array columns ? Thank you

推荐答案

在Spark 2.4之前，您可以使用udf:

Before Spark 2.4, you can use a udf:

from pyspark.sql.functions import udf

@udf('array<string>')
def array_union(*arr):
    return list(set([e.lstrip('0').zfill(5) for a in arr if isinstance(a, list) for e in a]))

df.withColumn('join_columns', array_union('column_1','column_2','column_3')).show(truncate=False)

注意:我们使用e.lstrip('0').zfill(5)，因此对于每个数组项，我们首先删除前导0，然后如果字符串的长度小于5，则在左侧填充0

Note: we use e.lstrip('0').zfill(5) so that for each array item, we first remove the leading 0 and then fill 0s to left if the length of string is less than 5.

这篇关于串联数组pyspark的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

串联数组pyspark [英] Concatenate array pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

串联数组pyspark [英] Concatenate array pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭