基于指定黑名单条件的另一个DataFrame过滤Spark DataFrame [英] Filter Spark DataFrame based on another DataFrame that specifies blacklist criteria
问题描述
我有一个largeDataFrame(多列和数十亿行)和一个smallDataFrame(单列和10,000行)。
I have a largeDataFrame (multiple columns and billions of rows) and a smallDataFrame (single column and 10,000 rows).
我想过滤所有行在 largeDataFrame
中,只要 largeDataFrame
中的 some_identifier
列匹配 smallDataFrame
中的一行。
I'd like to filter all the rows from the largeDataFrame
whenever the some_identifier
column in the largeDataFrame
matches one of the rows in the smallDataFrame
.
以下是一个示例:
largeDataFrame
largeDataFrame
some_idenfitier,first_name
111,bob
123,phil
222,mary
456,sue
smallDataFrame
smallDataFrame
some_identifier
123
456
desiredOutput
desiredOutput
111,bob
222,mary
这是我的丑陋的解决方案。
Here is my ugly solution.
val smallDataFrame2 = smallDataFrame.withColumn("is_bad", lit("bad_row"))
val desiredOutput = largeDataFrame.join(broadcast(smallDataFrame2), Seq("some_identifier"), "left").filter($"is_bad".isNull).drop("is_bad")
是有一个更清洁的解决方案?
Is there a cleaner solution?
推荐答案
我一定会使用一个 leftanti
在这种情况下加入:
I'd definitely use a leftanti
join in this case :
largeDataFrame
.join(smallDataFrame, Seq("some_identifier"),"leftanti")
.show
// +---------------+----------+
// |some_identifier|first_name|
// +---------------+----------+
// | 222| mary|
// | 111| bob|
// +---------------+----------+
左侧反加入与
The left anti join is the opposite of a left semi join. To make it simple, it filters out the data from the right table in the left table according to a key.
这篇关于基于指定黑名单条件的另一个DataFrame过滤Spark DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!