如何按包含某些其他数据帧/集的任何值的数组列过滤 Spark 数据帧 [英] How to filter Spark dataframe by array column containing any of the values of some other dataframe/set
问题描述
我有一个包含一列数组字符串的数据框 A.
I have a Dataframe A that contains a column of array string.
...
|-- browse: array (nullable = true)
| |-- element: string (containsNull = true)
...
例如三个样本行是
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
另一个包含一列字符串的 Dataframe B
And another Dataframe B that contains a column of string
|-- browsenodeid: string (nullable = true)
它的一些示例行是
+------------+
|browsenodeid|
+------------+
| A|
| Z|
| M|
如何过滤 A 以便保留 browse
包含来自 B 的 browsenodeid
的任何值的所有行?就上述示例而言,结果将是:
How can I filter A so that I keep all the rows whose browse
contains any of the the values of browsenodeid
from B? In terms of the above examples the result will be:
+---------+--=-----+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1| <- because Z is a value of B.browsenodeid
| foo3| [M]| bar3| <- because M is a value of B.browsenodeid
如果我只有一个值,那么我会使用类似的东西
If I had a single value then I would use something like
A.filter(array_contains(A("browse"), single_value))
但是我如何处理值的列表或数据帧?
But what do I do with a list or DataFrame of values?
推荐答案
我为此找到了一个优雅的解决方案,无需将 DataFrame
s/Dataset
s 转换为RDD
s.
I found an elegant solution for this, without the need to cast DataFrame
s/Dataset
s to RDD
s.
假设你有一个 DataFrame dataDF
:
Assuming you have a DataFrame dataDF
:
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
和一个数组 b
包含要在 browse
and an array b
containing the values you want to match in browse
val b: Array[String] = Array(M,Z)
实现udf:
import org.apache.spark.sql.expressions.UserDefinedFunction
import scala.collection.mutable.WrappedArray
def array_contains_any(s:Seq[String]): UserDefinedFunction = {
udf((c: WrappedArray[String]) =>
c.toList.intersect(s).nonEmpty)
}
然后简单地使用 filter
或 where
函数(有一点花哨的柯里化 :P)来做如下过滤:
and then simply use the filter
or where
function (with a little bit of fancy currying :P) to do the filtering like:
dataDF.where(array_contains_any(b)($"browse"))
这篇关于如何按包含某些其他数据帧/集的任何值的数组列过滤 Spark 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!