如何通过包含某些其他数据框/集合的任何值的数组列过滤Spark数据框 [英] How to filter Spark dataframe by array column containing any of the values of some other dataframe/set
问题描述
我有一个数据框A,其中包含一列数组字符串.
I have a Dataframe A that contains a column of array string.
...
|-- browse: array (nullable = true)
| |-- element: string (containsNull = true)
...
例如,将有三个示例行
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
和另一个包含一列字符串的数据框B
And another Dataframe B that contains a column of string
|-- browsenodeid: string (nullable = true)
可能会有一些示例行
+------------+
|browsenodeid|
+------------+
| A|
| Z|
| M|
如何过滤A,以便保留所有browse
包含B中browsenodeid
值的行?根据以上示例,结果将是:
How can I filter A so that I keep all the rows whose browse
contains any of the the values of browsenodeid
from B? In terms of the above examples the result will be:
+---------+--=-----+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1| <- because Z is a value of B.browsenodeid
| foo3| [M]| bar3| <- because M is a value of B.browsenodeid
如果我只有一个值,那么我会使用类似的
If I had a single value then I would use something like
A.filter(array_contains(A("browse"), single_value))
但是如何处理值的列表或DataFrame?
But what do I do with a list or DataFrame of values?
推荐答案
我为此找到了一种优雅的解决方案,无需将DataFrame
s/Dataset
s强制转换为RDD
s.
I found an elegant solution for this, without the need to cast DataFrame
s/Dataset
s to RDD
s.
假设您有一个DataFrame dataDF
:
Assuming you have a DataFrame dataDF
:
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
和一个数组b
,其中包含要在browse
and an array b
containing the values you want to match in browse
val b: Array[String] = Array(M,Z)
实施udf:
def array_contains_any(s: Seq[String]): UserDefinedFunction = udf((c: WrappedArray[String]) => c.toList.intersect(s).nonEmpty)
,然后只需使用filter
或where
函数(有点花哨的:P)来进行过滤,如下所示:
and then simply use the filter
or where
function (with a little bit of fancy currying :P) to do the filtering like:
dataDF.where(array_contains_any(b)($"browse"))
这篇关于如何通过包含某些其他数据框/集合的任何值的数组列过滤Spark数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!