来自Array列的所有组合的Spark Dataframe [英] Spark Dataframe from all combinations of Array column

查看：106 发布时间：2021/4/8 20:05:53 scala apache-spark apache-spark-sql

本文介绍了来自Array列的所有组合的Spark Dataframe的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个Spark DataFrame d1 ，其中有两列 elements_1 和 elements_2 ，其中包含大小为 k的整数集包含整数值的和 value_1 ， value_2 .例如，使用 k = 3 :

Assume I have a Spark DataFrame d1 with two columns, elements_1 and elements_2, that contain sets of integers of size k, and value_1, value_2 that contain a integer value. For example, with k = 3:

d1 = 
+------------+------------+
| elements_1 | elements_2 |
+-------------------------+
| (1, 4, 3)  |  (3, 4, 5) |
| (2, 1, 3)  |  (1, 0, 2) |
| (4, 3, 1)  |  (3, 5, 6) |
+-------------------------+

我需要创建一个新的列 combinations ，其中包含每对元素 elements_1 和 elements_2 的列表从其元素的所有可能组合.这些集合必须具有以下属性:

I need to create a new column combinations made that contains, for each pair of sets elements_1 and elements_2, a list of the sets from all possible combinations of their elements. These sets must have the following properties:

其大小必须为 k + 1
它们必须包含 elements_1 中的集合或 elements_2

Their size must be k+1
They must contain either the set in elements_1 or the set in elements_2

例如，从(1、2、3)和(3、4、5)中，我们获得 [(1、2、3、4)，(1、2、3、5)，(3、4、5、1)和(3、4、5、2)] .该列表不包含(1、2、5)，因为它的长度不是 3 + 1 ，并且不包含(1、2、4、4，5)，因为它都不包含原始集.

For example, from (1, 2, 3) and (3, 4, 5) we obtain [(1, 2, 3, 4), (1, 2, 3, 5), (3, 4, 5, 1) and (3, 4, 5, 2)]. The list does not contain (1, 2, 5) because it is not of length 3+1, and it does not contain (1, 2, 4, 5) because it contains neither of the original sets.

推荐答案

您需要创建一个自定义的用户定义函数来执行转换，创建一个与Spark兼容的

You need to create a custom user-defined function to perform the transformation, create a spark-compatible UserDefinedFunction from it, then apply using withColumn. So really, there are two questions here: (1) how to do the set transformation you described, and (2) how to create a new column in a DataFrame using a user-defined function.

这是设定逻辑的第一枪，让我知道它是否符合您的要求:

Here's a first shot at the set logic, let me know if it does what you're looking for:

def combo[A](a: Set[A], b: Set[A]): Set[Set[A]] = 
    a.diff(b).map(b+_) ++ b.diff(a).map(a+_)

现在创建UDF包装器.请注意，这些集合实际上都由WrappedArrays表示，因此我们需要进行处理.通过定义一些隐式转换，可能有一种更优雅的方式来处理此问题，但这应该可以工作:

Now create the UDF wrapper. Note that under the hood these sets are all represented by WrappedArrays, so we need to handle this. There's probably a more elegant way to deal with this by defining some implicit conversions, but this should work:

import scala.collection.mutable.WrappedArray
val comboWrap: (WrappedArray[Int],WrappedArray[Int])=>Array[Array[Int]] = 
    (x,y) => combo(x.toSet,y.toSet).map(_.toArray).toArray
val comboUDF = udf(comboWrap)

最后，通过创建新列将其应用于DataFrame:

Finally, apply it to the DataFrame by creating a new column:

val data = Seq((Set(1,2,3),Set(3,4,5))).toDF("elements_1","elements_2")
val result = data.withColumn("result", 
    comboUDF(col("elements_1"),col("elements_2")))
result.show

这篇关于来自Array列的所有组合的Spark Dataframe的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

来自Array列的所有组合的Spark Dataframe [英] Spark Dataframe from all combinations of Array column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

来自Array列的所有组合的Spark Dataframe [英] Spark Dataframe from all combinations of Array column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭