基于通用值的火花过滤器数据帧 [英] Spark filter DataFrames based on common values

查看:78
本文介绍了基于通用值的火花过滤器数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有DF1和DF2.第一个具有"new_id"列,第二个具有"db_id"列

I have DF1 and DF2. First one has a column "new_id", the second has a column "db_id"

我需要从第一个DataFrame过滤掉所有行,其中new_id的值不在db_id中.

I need to FILTER OUT all the rows from the first DataFrame, where the value of new_id is not in db_id.

val new_id = Seq(1, 2, 3, 4)
val db_id = Seq(1, 4, 5, 6, 10)

然后我需要将new_id == 1和4的行保留在df1中,并删除news_id = 2和3的行,因为2和3不在db_id中

Then I need the rows with new_id == 1 and 4 to stay in df1 and delete the rows with news_id = 2 and 3 since 2 and 3 are not in db_id

在这里有关DataFrames的问题很多.我可能已经错过了这个.抱歉,如果重复的话.

There is a ton of questions on DataFrames here. I might have missed this one. Sorry if that is a duplicate.

p.s如果有问题,我正在使用Scala.

p.s I am using Scala if that matters.

推荐答案

您需要的是左半乔恩:

import spark.implicits._

val DF1 = Seq(1,3).toDF("new_id")
val DF2 = Seq(1,2).toDF("db_id")


DF1.as("df1").join(DF2.as("df2"),$"df1.new_id"===$"df2.db_id","leftsemi")
.show()

+------+
|new_id|
+------+
|     1|
+------+

这篇关于基于通用值的火花过滤器数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆