获取Spark数据帧中ArrayType列的不同元素 [英] get the distinct elements of an ArrayType column in a spark dataframe

查看:50
本文介绍了获取Spark数据帧中ArrayType列的不同元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中包含3个列,分别为 id feat1 feat2 . feat1 feat2 的形式为字符串数组:

I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:

Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[] 

2, ["feat1_2"],["feat2_1","feat2_2"]

3,["feat1_4"],["feat2_3"]

我想获取每个功能列中不同元素的列表,因此输出为:

I want to get the list of distinct elements inside each feature column, so the output will be:

distinct_feat1,distinct_feat2
-----------------------------  
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]

在Scala中执行此操作的最佳方法是什么?

what is the best way to do this in Scala?

推荐答案

在应用 explode 函数后,可以使用 collect_set 查找相应列的不同值.在每列上取消嵌套每个单元格中的数组元素.假设您的数据框名为 df :

You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:

import org.apache.spark.sql.functions._

val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
                     withColumn("feat2", explode(col("feat2"))).
                     agg(collect_set("feat1").alias("distinct_feat1"), 
                         collect_set("feat2").alias("distinct_feat2"))

distinct_df.show
+--------------------+--------------------+
|      distinct_feat1|      distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+


distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
                                                WrappedArray(, feat2_1, feat2_2, feat2_3)])

这篇关于获取Spark数据帧中ArrayType列的不同元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆