来自 Array 列的所有组合的 Spark Dataframe [英] Spark Dataframe from all combinations of Array column
问题描述
假设我有一个包含两列 elements_1
和 elements_2
的 Spark DataFrame d1
,其中包含大小为 k 的整数集
和 value_1
、value_2
包含一个整数值.例如,使用 k = 3
:
Assume I have a Spark DataFrame d1
with two columns, elements_1
and elements_2
, that contain sets of integers of size k
, and value_1
, value_2
that contain a integer value. For example, with k = 3
:
d1 =
+------------+------------+
| elements_1 | elements_2 |
+-------------------------+
| (1, 4, 3) | (3, 4, 5) |
| (2, 1, 3) | (1, 0, 2) |
| (4, 3, 1) | (3, 5, 6) |
+-------------------------+
我需要创建一个新列 combinations
,其中包含每对集合 elements_1
和 elements_2
的集合列表来自它们元素的所有可能组合.这些集合必须具有以下属性:
I need to create a new column combinations
made that contains, for each pair of sets elements_1
and elements_2
, a list of the sets from all possible combinations of their elements. These sets must have the following properties:
- 它们的大小必须是
k+1
- 它们必须包含
elements_1
中的集合或elements_2
中的集合
- Their size must be
k+1
- They must contain either the set in
elements_1
or the set inelements_2
例如,从 (1, 2, 3)
和 (3, 4, 5)
我们得到 [(1, 2, 3, 4), (1, 2, 3, 5), (3, 4, 5, 1) 和 (3, 4, 5, 2)]
.该列表不包含 (1, 2, 5)
因为它的长度不是 3+1
,并且它不包含 (1, 2, 4, 5)
因为它不包含任何原始集合.
For example, from (1, 2, 3)
and (3, 4, 5)
we obtain [(1, 2, 3, 4), (1, 2, 3, 5), (3, 4, 5, 1) and (3, 4, 5, 2)]
. The list does not contain (1, 2, 5)
because it is not of length 3+1
, and it does not contain (1, 2, 4, 5)
because it contains neither of the original sets.
推荐答案
您需要创建一个自定义的用户定义函数来执行转换,创建一个与 spark 兼容的 UserDefinedFunction 来自它,然后使用 withColumn 应用.所以实际上,这里有两个问题:(1)如何进行您描述的集合转换,以及(2)如何使用用户定义的函数在 DataFrame 中创建一个新列.
You need to create a custom user-defined function to perform the transformation, create a spark-compatible UserDefinedFunction from it, then apply using withColumn. So really, there are two questions here: (1) how to do the set transformation you described, and (2) how to create a new column in a DataFrame using a user-defined function.
这是设置逻辑的第一次尝试,如果它符合您的要求,请告诉我:
Here's a first shot at the set logic, let me know if it does what you're looking for:
def combo[A](a: Set[A], b: Set[A]): Set[Set[A]] =
a.diff(b).map(b+_) ++ b.diff(a).map(a+_)
现在创建 UDF 包装器.请注意,在幕后,这些集合都由 WrappedArrays 表示,因此我们需要处理它.通过定义一些隐式转换,可能有一种更优雅的方法来处理这个问题,但这应该有效:
Now create the UDF wrapper. Note that under the hood these sets are all represented by WrappedArrays, so we need to handle this. There's probably a more elegant way to deal with this by defining some implicit conversions, but this should work:
import scala.collection.mutable.WrappedArray
val comboWrap: (WrappedArray[Int],WrappedArray[Int])=>Array[Array[Int]] =
(x,y) => combo(x.toSet,y.toSet).map(_.toArray).toArray
val comboUDF = udf(comboWrap)
最后,通过创建一个新列将其应用到 DataFrame:
Finally, apply it to the DataFrame by creating a new column:
val data = Seq((Set(1,2,3),Set(3,4,5))).toDF("elements_1","elements_2")
val result = data.withColumn("result",
comboUDF(col("elements_1"),col("elements_2")))
result.show
这篇关于来自 Array 列的所有组合的 Spark Dataframe的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!