如何在Spark Dataframe中过滤具有多个密钥的重复记录？ [英] How to filter duplicate records having multiple key in Spark Dataframe?

查看：140 发布时间：2017/3/25 23:42:13 scala apache-spark dataframe

本文介绍了如何在Spark Dataframe中过滤具有多个密钥的重复记录？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个数据帧。我想根据Data Frame-B中的一些常用的列值删除Data Frame-A中的一些记录。

例如：
数据帧-A：

数据帧-B： p>

 键：A，B，C列

期望的输出：

  ABCD 
 3 4 5 7 
 4 7 9 6

任何解决方案。

解决方案

您正在寻找 left anti-join ：

  df_a.join（df_b，Seq（A ，B，C），leftanti）。show（）
 + --- + --- + --- + --- + 
 | A | B | C | D | 
 + --- + --- + --- + --- + 
 | 3 | 4 | 5 | 7 | 
 | 4 | 7 | 9 | 6 | 
 + --- + --- + --- + --- +

I have two dataframes. I want to delete some records in Data Frame-A based on some common column values in Data Frame-B.

For Example: Data Frame-A:

Data Frame-B:

Keys: A,B,C columns

Desired Output:

A B C D
3 4 5 7
4 7 9 6

Any solution for this.

解决方案

You are looking for left anti-join:

df_a.join(df_b, Seq("A","B","C"), "leftanti").show()
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+
|  3|  4|  5|  7|
|  4|  7|  9|  6|
+---+---+---+---+

这篇关于如何在Spark Dataframe中过滤具有多个密钥的重复记录？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Spark Dataframe中过滤具有多个密钥的重复记录？ [英] How to filter duplicate records having multiple key in Spark Dataframe?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Spark Dataframe中过滤具有多个密钥的重复记录？ [英] How to filter duplicate records having multiple key in Spark Dataframe?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭