数据框元素上的配对明智比较 [英] Pair Wise comparison on DataFrame Elements

查看:74
本文介绍了数据框元素上的配对明智比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何进行成对迭代列以查找相似性.

将一个数据帧的所有列中的所有元素与另一个数据帧的所有列中的所有元素进行比较.

For All the Elemets from All The Colunms of one Data Frame, to be compared with all the elements from all the colunms of another Data Frame.

例如:

df1具有两个字段,名称&年龄

df1 has two fields Name & Age

姓名,年龄
28岁的"Ajay Malhotra"
Sujata Krishanan,27岁,
"Madhav Shankar",33

Name , Age
"Ajay Malhotra", 28
"Sujata Krishanan" , 27
"Madhav Shankar" , 33

df2具有两个字段UserId& EmpId,电子邮件

df2 has two fields UserId & EmpId, eMail

"UserID","Emp ID","Email"
--------------------------------------
"Ajay.Malhotra",100,"a.malt@nothing.com"
"Madhav.Shankar",101,"m.shankar"
"Sujata.Kris",1001,"Kris.Suja@nothing.com"

" UserID " , " Emp ID " , "Email "
--------------------------------------
"Ajay.Malhotra", 100, "a.malt@nothing.com"
"Madhav.Shankar" , 101, "m.shankar"
"Sujata.Kris" , 1001,"Kris.Suja@nothing.com"


提供匹配值的某些方法可以使用一些hardCode 0.73作为示例


Some Method to give a Match Value can some hardCode 0.73 as example

def chekIfSame(leftString: String, rightString: String): Double = { // Some Logic ..Gives a MatchValue 0.73 }

def chekIfSame(leftString: String, rightString: String): Double = { // Some Logic ..Gives a MatchValue 0.73 }

如何获取df_1中的每个Colunms和df2中的每个Colunms并将其传递给chekIfSame.
输出可能是这样的笛卡尔积

How to take Each Colunms from df_1, and each Colunms from df2 , and pass it to chekIfSame.
Output could be a Cartesian product like this

Name,UserId,MatchValue
--------------------------------------
Sujata.Kris,"Sujata Krishanan",0.85
"Ajay Malhotra",Ajay.Malhotra,0.98
Sujata.Kris"Ajay Malhotra",0.07

Name , UserId, MatchValue
--------------------------------------
"Sujata Krishanan", Sujata.Kris, 0.85
"Ajay Malhotra", Ajay.Malhotra , 0.98
"Ajay Malhotra", Sujata.Kris , 0.07

推荐答案

嵌套了两个DataFrame每个循环

我们将无法嵌套循环. 但是,我们可以将其加入并传递给函数

We wont be able to nested loop it. But, we can Join and Pass it to a Function

joined = leftDf.join(rightDf)
val joinedWithScore = joined.withColumn("simlarScore", chekIfSame( joined(ltColName) , joined(rtColName)))

为此,我们需要在执行上述操作之前将其与UDF放在chekIfSame中.

For this, we need to have it in chekIfSame as a UDF prior to the above operation.

def checkSimilarity = udf((left:String,right:String):Double => { 
// Logic or hard code 0..73
0.73

}

这篇关于数据框元素上的配对明智比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆