如何找到两个数据框之间的精确和非精确匹配? [英] How to find exact and non-exact matches between two dataframes?

查看:149
本文介绍了如何找到两个数据框之间的精确和非精确匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框.

df1

+--------+-------------------
|id  | amount       | fee   | 
|1   | 10.00        | 5.0   |
|2   | 20.0         | 3.0   |
|3   | 90           | 130.0 |
|4   | 120.0        | 35.0  |

df2

+--------+--------------------
|exId  | exAmount     | exFee| 
|1     | 10.00        | 5.0  |
|2     | 20.0         | 3.0  |
|3     | 20.0         | 3.0  |
|4     | 120.0        | 3.0  |

我需要执行以下操作

  1. 找到所有三列都匹配的常见行,例如在上面的示例中为ID 1,2.
  2. 找到(id,exId)匹配但其他不匹配的常见行,例如3&在上面的示例中为4.如果我们确定哪些列不匹配,这将很有用.

所以输出看起来像这样

完全匹配

+--------+---------------------------------------------
|id  | amount       | fee   | exId | exAmount | exFee | 
|1   | 10.00        | 5.0   | 1    |  10.00   | 5.0   |  
|2   | 20.0         | 3.0   | 2    |  20.00   | 3.0   |
+--------+---------------------------------------------

不完全匹配

+--------+------------------------------------------------------------
|id  | amount       | fee   | exId | exAmount | exFee | mismatchFields|
|3   | 90.00        | 130.0 | 3    |  20.00   | 3.0   |  [fee, amount]|
|4   | 120.0        | 35.0  | 4    |  120.00  | 3.0   |  [fee]        |
+--------+------------------------------------------------------------  

有什么想法吗?

推荐答案

查找所有三列都匹配的常见行,例如在上面的示例中为ID 1,2.

Find common rows in which all three columns match for e.g. id 1,2 in the above example.

这很容易,您只需加入时检查所有列是否相等

df1.join(df2, df1("id") === df2("exId") && df1("amount") === df2("exAmount") && df1("fee") === df2("exFee")).show(false)

应该给您

+---+------+---+----+--------+-----+
|id |amount|fee|exId|exAmount|exFee|
+---+------+---+----+--------+-----+
|1  |10.00 |5.0|1   |10.00   |5.0  |
|2  |20.0  |3.0|2   |20.0    |3.0  |
+---+------+---+----+--------+-----+

查找(id,exId)匹配但其他不匹配的常见行,例如3&在上面的示例中为4.如果我们确定哪些列不匹配,这将很有用.

Find common rows in which (id, exId) match but others don't i.e. for e.g. 3 & 4 in the above example. It would be useful if we identify which of the columns didn't match.

为此,您必须检查第一列的相等性,但是其余两列的相等性并在有条件的情况下做一些以获取最后一列列

for this you have to check for equality for the first column but en-equality for the rest two columns and do some when condition to get the last column

import org.apache.spark.sql.functions._
df1.join(df2, df1("id") === df2("exId") && (df1("amount") =!= df2("exAmount") || df1("fee") =!= df2("exFee")))
.withColumn("mismatchFields", when(col("amount") =!= col("exAmount") && col("fee") =!= col("exFee"), array(lit("amount"), lit("fee"))).otherwise(
  when(col("amount") === col("exAmount") && col("fee") =!= col("exFee"), array(lit("fee"))).otherwise(array(lit("amount")))
)).show(false)

应该给您

+---+------+-----+----+--------+-----+--------------+
|id |amount|fee  |exId|exAmount|exFee|mismatchFields|
+---+------+-----+----+--------+-----+--------------+
|3  |90    |130.0|3   |20.0    |3.0  |[amount, fee] |
|4  |120.0 |35.0 |4   |120.0   |3.0  |[fee]         |
+---+------+-----+----+--------+-----+--------------+

我希望答案会有所帮助

这篇关于如何找到两个数据框之间的精确和非精确匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆