如何在连接中将Column.isin与数组列一起使用? [英] How to use Column.isin with array column in join?

查看:108
本文介绍了如何在连接中将Column.isin与数组列一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

case class Foo1(codes:Seq[String], name:String)
case class Foo2(code:String, description:String)

val ds1 = Seq(
  Foo1(Seq("A"),           "foo1"),
  Foo1(Seq("A", "B"),      "foo2"),
  Foo1(Seq("B", "C", "D"), "foo3"),
  Foo1(Seq("C"),           "foo4"),
  Foo1(Seq("C", "D"),      "foo5")
).toDS

val ds2 = Seq(
  Foo2("A", "product A"),
  Foo2("B", "product B"),
  Foo2("C", "product C"),
  Foo2("D", "product D"),
  Foo2("E", "product E")
).toDS

val j = ds1.join(ds2, ds2("code") isin (ds1("codes")))

希望这个Scala代码片段可以清楚地说明我要完成的工作,我们的数据经过结构化,以便一个数据集具有一个包含值数组的列,并且我希望将该集合中的值连接到另一个数据集.因此,例如ds1中的Seq("A", "B")将与ds2中的"A""B"连接.

Hopefully this Scala code fragment makes it clear what I'm trying to accomplish, our data is structured so that one data set has a column which contains an array of values, and I wish to join the values within that collection to another data set. So for example Seq("A", "B") in ds1 would join with "A" and "B" in ds2.

Column上的"isin"运算符似乎正是我想要的,并且它可以构建并运行,但是当我运行它时,出现以下错误:

The "isin" operator on Column seems to be exactly what I want, and this builds and runs, but when I run it I get the following error:

org.apache.spark.sql.AnalysisException:由于数据类型不匹配而无法解析'(code IN(codes))'':参数必须为同一类型;;

org.apache.spark.sql.AnalysisException: cannot resolve '(code IN (codes))' due to data type mismatch: Arguments must be same type;;

进一步阅读,我发现isin()想要使用varargs("splatted args"),并且似乎更适合filter().所以我的问题是,这是该运算符的预期用途,还是有其他方法可以执行这种类型的联接?

Reading further I see that isin() wants to take a varargs ("splatted args") and seems more suitable for a filter(). So my question is, is this the intended use of this operator, or is there some other way to perform this type of join?

推荐答案

请使用array_contains:

ds1.crossJoin(ds2).where("array_contains(codes, code)").show

+---------+----+----+-----------+
|    codes|name|code|description|
+---------+----+----+-----------+
|      [A]|foo1|   A|  product A|
|   [A, B]|foo2|   A|  product A|
|   [A, B]|foo2|   B|  product B|
|[B, C, D]|foo3|   B|  product B|
|[B, C, D]|foo3|   C|  product C|
|[B, C, D]|foo3|   D|  product D|
|      [C]|foo4|   C|  product C|
|   [C, D]|foo5|   C|  product C|
|   [C, D]|foo5|   D|  product D|
+---------+----+----+-----------+

如果使用Spark 1.x或2.0,则将crossJoin替换为标准联接,并如有必要,在配置中启用交叉联接.

If you use Spark 1.x or 2.0 replace crossJoin with standard join, and enable cross joins in configuration, if necessary.

有可能避免使用explode的笛卡尔积:

It might by possible to avoid Cartesian product with explode:

ds1.withColumn("code", explode($"codes")).join(ds2, Seq("code")).show
+----+---------+----+-----------+                                               
|code|    codes|name|description|
+----+---------+----+-----------+
|   B|   [A, B]|foo2|  product B|
|   B|[B, C, D]|foo3|  product B|
|   D|[B, C, D]|foo3|  product D|
|   D|   [C, D]|foo5|  product D|
|   C|[B, C, D]|foo3|  product C|
|   C|      [C]|foo4|  product C|
|   C|   [C, D]|foo5|  product C|
|   A|      [A]|foo1|  product A|
|   A|   [A, B]|foo2|  product A|
+----+---------+----+-----------+

这篇关于如何在连接中将Column.isin与数组列一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆