按值过滤数据框,该值不存在于其他数据框的列中 [英] Filter dataframe by value NOT present in column of other dataframe

查看:70
本文介绍了按值过滤数据框,该值不存在于其他数据框的列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用这个晃一下脑袋,我怀疑答案很简单.给定两个数据框,我想过滤掉第一个数据框的值不在另一个数据框的列中的第一个数据.

Banging my head a little with this one, and I suspect the answer is very simple. Given two dataframes, I want to filter the first where values in one column are not present in a column of another dataframe.

我希望不使用成熟的Spark SQL来执行此操作,因此仅使用DataFrame.filter或Column.contains或"isin"关键字,或使用其中一种连接方法.

I would like to do this without resorting to full-blown Spark SQL, so just using DataFrame.filter, or Column.contains or the "isin" keyword, or one of the join methods.

val df1 = Seq(("Hampstead", "London"), 
              ("Spui", "Amsterdam"), 
              ("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")

val res = df1.filter(df2("cities").contains("city") === false)
// doesn't work, nor do the 20 other variants I have tried

有人有什么主意吗?

推荐答案

我发现我可以使用更简单的方法解决此问题-似乎可以将antijoin作为join方法的参数,但是Spark Scaladoc没有描述它:

I've discovered that I can solve this using a simpler method - it seems that an antijoin is possible as a parameter to the join method, but the Spark Scaladoc does not describe it:

import org.apache.spark.sql.functions._

val df1 = Seq(("Hampstead", "London"), 
              ("Spui", "Amsterdam"), 
              ("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")

df1.join(df2, df1("city") === df2("cities"), "leftanti").show

结果:

+----------+-------+ 
|  location|   city| 
+----------+-------+ 
|Chittagong|Chennai| 
+----------+-------+  

P.S.感谢您提供指向重复项的指针-正确标记为

P.S. thanks for the pointer to the duplicate - duly marked as such

这篇关于按值过滤数据框,该值不存在于其他数据框的列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆