按值过滤数据框,该值不存在于其他数据框的列中 [英] Filter dataframe by value NOT present in column of other dataframe
问题描述
用这个晃一下脑袋,我怀疑答案很简单.给定两个数据框,我想过滤掉第一个数据框的值不在另一个数据框的列中的第一个数据.
Banging my head a little with this one, and I suspect the answer is very simple. Given two dataframes, I want to filter the first where values in one column are not present in a column of another dataframe.
我希望不使用成熟的Spark SQL来执行此操作,因此仅使用DataFrame.filter或Column.contains或"isin"关键字,或使用其中一种连接方法.
I would like to do this without resorting to full-blown Spark SQL, so just using DataFrame.filter, or Column.contains or the "isin" keyword, or one of the join methods.
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
val res = df1.filter(df2("cities").contains("city") === false)
// doesn't work, nor do the 20 other variants I have tried
有人有什么主意吗?
推荐答案
我发现我可以使用更简单的方法解决此问题-似乎可以将antijoin作为join方法的参数,但是Spark Scaladoc没有描述它:
I've discovered that I can solve this using a simpler method - it seems that an antijoin is possible as a parameter to the join method, but the Spark Scaladoc does not describe it:
import org.apache.spark.sql.functions._
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
df1.join(df2, df1("city") === df2("cities"), "leftanti").show
结果:
+----------+-------+
| location| city|
+----------+-------+
|Chittagong|Chennai|
+----------+-------+
P.S.感谢您提供指向重复项的指针-正确标记为
P.S. thanks for the pointer to the duplicate - duly marked as such
这篇关于按值过滤数据框,该值不存在于其他数据框的列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!