如何在连接中将Column.isin与数组列一起使用? [英] How to use Column.isin with array column in join?
问题描述
case class Foo1(codes:Seq[String], name:String)
case class Foo2(code:String, description:String)
val ds1 = Seq(
Foo1(Seq("A"), "foo1"),
Foo1(Seq("A", "B"), "foo2"),
Foo1(Seq("B", "C", "D"), "foo3"),
Foo1(Seq("C"), "foo4"),
Foo1(Seq("C", "D"), "foo5")
).toDS
val ds2 = Seq(
Foo2("A", "product A"),
Foo2("B", "product B"),
Foo2("C", "product C"),
Foo2("D", "product D"),
Foo2("E", "product E")
).toDS
val j = ds1.join(ds2, ds2("code") isin (ds1("codes")))
希望这个Scala代码片段可以清楚地说明我要完成的工作,我们的数据经过结构化,以便一个数据集具有一个包含值数组的列,并且我希望将该集合中的值连接到另一个数据集.因此,例如ds1
中的Seq("A", "B")
将与ds2
中的"A"
和"B"
连接.
Hopefully this Scala code fragment makes it clear what I'm trying to accomplish, our data is structured so that one data set has a column which contains an array of values, and I wish to join the values within that collection to another data set. So for example Seq("A", "B")
in ds1
would join with "A"
and "B"
in ds2
.
Column上的"isin"运算符似乎正是我想要的,并且它可以构建并运行,但是当我运行它时,出现以下错误:
The "isin" operator on Column seems to be exactly what I want, and this builds and runs, but when I run it I get the following error:
org.apache.spark.sql.AnalysisException:由于数据类型不匹配而无法解析'(
code
IN(codes
))'':参数必须为同一类型;;
org.apache.spark.sql.AnalysisException: cannot resolve '(
code
IN (codes
))' due to data type mismatch: Arguments must be same type;;
进一步阅读,我发现isin()
想要使用varargs("splatted args"),并且似乎更适合filter()
.所以我的问题是,这是该运算符的预期用途,还是有其他方法可以执行这种类型的联接?
Reading further I see that isin()
wants to take a varargs ("splatted args") and seems more suitable for a filter()
. So my question is, is this the intended use of this operator, or is there some other way to perform this type of join?
推荐答案
请使用array_contains
:
ds1.crossJoin(ds2).where("array_contains(codes, code)").show
+---------+----+----+-----------+
| codes|name|code|description|
+---------+----+----+-----------+
| [A]|foo1| A| product A|
| [A, B]|foo2| A| product A|
| [A, B]|foo2| B| product B|
|[B, C, D]|foo3| B| product B|
|[B, C, D]|foo3| C| product C|
|[B, C, D]|foo3| D| product D|
| [C]|foo4| C| product C|
| [C, D]|foo5| C| product C|
| [C, D]|foo5| D| product D|
+---------+----+----+-----------+
如果使用Spark 1.x或2.0,则将crossJoin
替换为标准联接,并如有必要,在配置中启用交叉联接.
If you use Spark 1.x or 2.0 replace crossJoin
with standard join, and enable cross joins in configuration, if necessary.
有可能避免使用explode
的笛卡尔积:
It might by possible to avoid Cartesian product with explode
:
ds1.withColumn("code", explode($"codes")).join(ds2, Seq("code")).show
+----+---------+----+-----------+
|code| codes|name|description|
+----+---------+----+-----------+
| B| [A, B]|foo2| product B|
| B|[B, C, D]|foo3| product B|
| D|[B, C, D]|foo3| product D|
| D| [C, D]|foo5| product D|
| C|[B, C, D]|foo3| product C|
| C| [C]|foo4| product C|
| C| [C, D]|foo5| product C|
| A| [A]|foo1| product A|
| A| [A, B]|foo2| product A|
+----+---------+----+-----------+
这篇关于如何在连接中将Column.isin与数组列一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!