按其他数据帧列中不存在的值过滤数据帧 [英] Filter dataframe by value NOT present in column of other dataframe
问题描述
用这个让我有点头疼,我怀疑答案很简单.给定两个数据帧,我想过滤第一个,其中一列中的值不存在于另一数据帧的一列中.
Banging my head a little with this one, and I suspect the answer is very simple. Given two dataframes, I want to filter the first where values in one column are not present in a column of another dataframe.
我想在不求助于成熟的 Spark SQL 的情况下执行此操作,因此只需使用 DataFrame.filter、Column.contains 或isin"关键字或连接方法之一.
I would like to do this without resorting to full-blown Spark SQL, so just using DataFrame.filter, or Column.contains or the "isin" keyword, or one of the join methods.
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
val res = df1.filter(df2("cities").contains("city") === false)
// doesn't work, nor do the 20 other variants I have tried
有人有什么想法吗?
推荐答案
我发现我可以使用更简单的方法解决这个问题 - 似乎可以将反联接作为 join 方法的参数,但是 Spark Scaladoc不描述它:
I've discovered that I can solve this using a simpler method - it seems that an antijoin is possible as a parameter to the join method, but the Spark Scaladoc does not describe it:
import org.apache.spark.sql.functions._
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
df1.join(df2, df1("city") === df2("cities"), "leftanti").show
结果:
+----------+-------+
| location| city|
+----------+-------+
|Chittagong|Chennai|
+----------+-------+
附言感谢您提供指向副本的指针 - 已正确标记
P.S. thanks for the pointer to the duplicate - duly marked as such
这篇关于按其他数据帧列中不存在的值过滤数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!