如何通过包含某些其他数据框/集合的任何值的数组列过滤Spark数据框 [英] How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

查看：83 发布时间：2020/9/4 9:06:53 apache-spark apache-spark-sql spark-dataframe

本文介绍了如何通过包含某些其他数据框/集合的任何值的数组列过滤Spark数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框A，其中包含一列数组字符串.

I have a Dataframe A that contains a column of array string.

...
 |-- browse: array (nullable = true)
 |    |-- element: string (containsNull = true)
...

例如，将有三个示例行

+---------+--------+---------+
| column 1|  browse| column n|
+---------+--------+---------+
|     foo1| [X,Y,Z]|     bar1|
|     foo2|   [K,L]|     bar2|
|     foo3|     [M]|     bar3|

和另一个包含一列字符串的数据框B

And another Dataframe B that contains a column of string

|-- browsenodeid: string (nullable = true)

可能会有一些示例行

+------------+
|browsenodeid|
+------------+
|           A|
|           Z|
|           M|

如何过滤A，以便保留所有browse包含B中browsenodeid值的行?根据以上示例，结果将是:

How can I filter A so that I keep all the rows whose browse contains any of the the values of browsenodeid from B? In terms of the above examples the result will be:

+---------+--=-----+---------+
| column 1|  browse| column n|
+---------+--------+---------+
|     foo1| [X,Y,Z]|     bar1| <- because Z is a value of B.browsenodeid
|     foo3|     [M]|     bar3| <- because M is a value of B.browsenodeid

如果我只有一个值，那么我会使用类似的

If I had a single value then I would use something like

A.filter(array_contains(A("browse"), single_value))

但是如何处理值的列表或DataFrame?

But what do I do with a list or DataFrame of values?

推荐答案

我为此找到了一种优雅的解决方案，无需将DataFrame s/Dataset s强制转换为RDD s.

I found an elegant solution for this, without the need to cast DataFrames/Datasets to RDDs.

假设您有一个DataFrame dataDF:

Assuming you have a DataFrame dataDF:

+---------+--------+---------+
| column 1|  browse| column n|
+---------+--------+---------+
|     foo1| [X,Y,Z]|     bar1|
|     foo2|   [K,L]|     bar2|
|     foo3|     [M]|     bar3|

和一个数组b，其中包含要在browse

and an array b containing the values you want to match in browse

val b: Array[String] = Array(M,Z)

实施udf:

def array_contains_any(s: Seq[String]): UserDefinedFunction = udf((c: WrappedArray[String]) => c.toList.intersect(s).nonEmpty)

，然后只需使用filter或where函数(有点花哨的:P)来进行过滤，如下所示:

and then simply use the filter or where function (with a little bit of fancy currying :P) to do the filtering like:

dataDF.where(array_contains_any(b)($"browse"))

这篇关于如何通过包含某些其他数据框/集合的任何值的数组列过滤Spark数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何通过包含某些其他数据框/集合的任何值的数组列过滤Spark数据框 [英] How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何通过包含某些其他数据框/集合的任何值的数组列过滤Spark数据框 [英] How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭