如何在两个数据帧之间找到精确和非精确匹配? [英] How to find exact and non-exact matches between two dataframes?

查看:20
本文介绍了如何在两个数据帧之间找到精确和非精确匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框.

df1

+--------+-----------|身份证 |金额 |费用||1 |10.00 |5.0 ||2 |20.0 |3.0 ||3 |90 |130.0 ||4 |120.0 |35.0 |

df2

+--------+------------|exId |exAmount |费用||1 |10.00 |5.0 ||2 |20.0 |3.0 ||3 |20.0 |3.0 ||4 |120.0 |3.0 |

我需要执行以下操作

  1. 查找所有三列都匹配的公共行,例如上例中的 id 1,2.
  2. 查找 (id, exId) 匹配但其他不匹配的常见行,例如3 &上面例子中的4.如果我们确定哪些列不匹配,这将很有用.

所以输出看起来像这样

精确匹配

+--------+-------------------------------------------|身份证 |金额 |费用|出处 |exAmount |手续费 ||1 |10.00 |5.0 |1 |10.00 |5.0 ||2 |20.0 |3.0 |2 |20.00 |3.0 |+--------+---------------------------------------------

不完全匹配

+--------+----------------------------------------------------------|身份证 |金额 |费用|出处 |exAmount |exFee |不匹配字段||3 |90.00 |130.0 |3 |20.00 |3.0 |[费用、金额]||4 |120.0 |35.0 |4 |120.00 |3.0 |[费用] |+--------+------------------------------------------------------------

有什么想法吗?

解决方案

<块引用>

查找所有三列都匹配的公共行,例如上例中的 id 1,2.

这很简单,您只需要在加入时检查所有列是否相等

df1.join(df2, df1("id") === df2("exId") && df1("amount") === df2("exAmount") &&df1("fee") === df2("exFee")).show(false)

应该给你

+---+------+---+---+----+--------+-----+|id |金额|费用|exId|exAmount|exFee|+---+------+---+----+----+--------+-----+|1 |10.00 |5.0|1 |10.00 |5.0 ||2 |20.0 |3.0|2 |20.0 |3.0 |+---+------+---+----+----+--------+-----+

<块引用><块引用>

查找 (id, exId) 匹配但其他不匹配的公共行,例如3 &上面例子中的4.如果我们确定哪些列不匹配,这将很有用.

为此,您必须检查第一列是否相等,但其余两列是否相等,并在条件下执行一些操作以获得最后一列列

import org.apache.spark.sql.functions._df1.join(df2, df1("id") === df2("exId") && (df1("amount") =!= df2("exAmount") || df1("fee") =!= df2("exFee"))).withColumn("mismatchFields", when(col("amount") =!= col("exAmount") && col("fee") =!= col("exFee"), array(lit("amount")"), lit("fee"))).否则(when(col("amount") === col("exAmount") && col("fee") =!= col("exFee"), array(lit("fee"))).otherwise(阵列(点亮(数量"))))).显示(假)

应该给你

+---+------+-----+----+--------+-----+--------------+|id |amount|费用 |exId|exAmount|exFee|mismatchFields|+---+------+-----+----+--------+-----+--------------+|3 |90 |130.0|3 |20.0 |3.0 |[金额、费用] ||4 |120.0 |35.0 |4 |120.0 |3.0 |[费用] |+---+------+-----+----+--------+-----+--------------+

希望回答对你有帮助

I have two dataframes.

df1

+--------+-------------------
|id  | amount       | fee   | 
|1   | 10.00        | 5.0   |
|2   | 20.0         | 3.0   |
|3   | 90           | 130.0 |
|4   | 120.0        | 35.0  |

df2

+--------+--------------------
|exId  | exAmount     | exFee| 
|1     | 10.00        | 5.0  |
|2     | 20.0         | 3.0  |
|3     | 20.0         | 3.0  |
|4     | 120.0        | 3.0  |

I need to perform the following operations

  1. Find common rows in which all three columns match for e.g. id 1,2 in the above example.
  2. Find common rows in which (id, exId) match but others don't i.e. for e.g. 3 & 4 in the above example. It would be useful if we identify which of the columns didn't match.

So the output would look like this

exact Match

+--------+---------------------------------------------
|id  | amount       | fee   | exId | exAmount | exFee | 
|1   | 10.00        | 5.0   | 1    |  10.00   | 5.0   |  
|2   | 20.0         | 3.0   | 2    |  20.00   | 3.0   |
+--------+---------------------------------------------

non-exact match

+--------+------------------------------------------------------------
|id  | amount       | fee   | exId | exAmount | exFee | mismatchFields|
|3   | 90.00        | 130.0 | 3    |  20.00   | 3.0   |  [fee, amount]|
|4   | 120.0        | 35.0  | 4    |  120.00  | 3.0   |  [fee]        |
+--------+------------------------------------------------------------  

Any thoughts?

解决方案

Find common rows in which all three columns match for e.g. id 1,2 in the above example.

this is quite easy, you just have to check all the columns for equality while joining

df1.join(df2, df1("id") === df2("exId") && df1("amount") === df2("exAmount") && df1("fee") === df2("exFee")).show(false)

which should give you

+---+------+---+----+--------+-----+
|id |amount|fee|exId|exAmount|exFee|
+---+------+---+----+--------+-----+
|1  |10.00 |5.0|1   |10.00   |5.0  |
|2  |20.0  |3.0|2   |20.0    |3.0  |
+---+------+---+----+--------+-----+

Find common rows in which (id, exId) match but others don't i.e. for e.g. 3 & 4 in the above example. It would be useful if we identify which of the columns didn't match.

for this you have to check for equality for the first column but en-equality for the rest two columns and do some when condition to get the last column

import org.apache.spark.sql.functions._
df1.join(df2, df1("id") === df2("exId") && (df1("amount") =!= df2("exAmount") || df1("fee") =!= df2("exFee")))
.withColumn("mismatchFields", when(col("amount") =!= col("exAmount") && col("fee") =!= col("exFee"), array(lit("amount"), lit("fee"))).otherwise(
  when(col("amount") === col("exAmount") && col("fee") =!= col("exFee"), array(lit("fee"))).otherwise(array(lit("amount")))
)).show(false)

which should give you

+---+------+-----+----+--------+-----+--------------+
|id |amount|fee  |exId|exAmount|exFee|mismatchFields|
+---+------+-----+----+--------+-----+--------------+
|3  |90    |130.0|3   |20.0    |3.0  |[amount, fee] |
|4  |120.0 |35.0 |4   |120.0   |3.0  |[fee]         |
+---+------+-----+----+--------+-----+--------------+

I hope the answer is helpful

这篇关于如何在两个数据帧之间找到精确和非精确匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆