MinHash Spark ML 中 OR 条件的字符串相似度 [英] String similarity with OR condition in MinHash Spark ML

查看:23
本文介绍了MinHash Spark ML 中 OR 条件的字符串相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集,第一个是大参考数据集,第二个数据集将通过 MinHash 算法从第一个数据集中找到最佳匹配.

I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm.

val dataset1 = 
+-------------+----------+------+------+-----------------------+
|           x'|        y'|    a'|    b'|   dataString(x'+y'+a')|
+-------------+----------+------+------+-----------------------+
|         John|     Smith| 55649| 28200|       John|Smith|55649|
|         Emma|   Morales| 78439| 34200|     Emma|Morales|78439|
|        Janet|  Alvarado| 89488| 29103|   Janet|Alvarado|89488|
|    Elizabeth|         K| 36935| 38101|      Elizabeth|K|36935|
|      Cristin|      Cruz| 75716| 70015|     Cristin|Cruz|75716|
|         Jack|   Colello| 94552| 15609|     Jack|Colello|94552|
|     Anatolie|     Trifa| 63011| 51181|   Anatolie|Trifa|63011|
|      Jaromir|      Plch| 51237| 91798|     Jaromir|Plch|51237|
+-------------+----------+------+------+-----------------------+

// very_large
val dataset2 =
+-------------+----------+------+-----------------------+
|            x|         y|     a|      dataString(x+y+a)|
+-------------+----------+------+-----------------------+
|         John|     Smith| 28200|       John|Smith|28200|
|         Emma|   Morales| 17706|     Emma|Morales|17706|
|        Janet|  Alvarado| 98809|   Janet|Alvarado|98809|
|    Elizabeth|   Keatley| 36935|Elizabeth|Keatley|36935|
|     Cristina|      Cruz| 75716|    Cristina|Cruz|75716|
|         Jake|   Colello| 15609|     Jake|Colello|15609|
|     Anatolie|     Trifa| 63011|   Anatolie|Trifa|63011|
|         Rune|      Eide| 41907|        Rune|Eide|41907|
|    Hortensia|   Brumaru| 33836|Hortensia|Brumaru|33836|
|       Adrien|     Payet| 40463|     Adrien|Payet|40463|
|       Ashley|    Howard| 12445|    Ashley|Howard|12445|
|       Pamela|      Dean| 81311|      Pamela|Dean|81311|
|        Laura|     Calvo| 82682|      Laura|Calvo|82682|
|        Flora|   Parghel| 81206|    Flora|Parghel|81206|
|      Jaromír|      Plch| 91798|     Jaromír|Plch|91798|
+-------------+----------+------+-----------------------+

为了字符串相似性,创建|(管道)分隔 dataString.

For string similarity, created | (pipe) separated dataString.

这是dataString (x' + y' + a')dataString(x + y + a) 的相似性查找代码,运行良好,

Here is the code for similarity finding of dataString (x' + y' + a') and dataString(x + y + a) which is working fine,

val tokenizer = new RegexTokenizer().setPattern("\\|").setInputCol("dataString").setOutputCol("dataStringWords")
val vectorizer = new CountVectorizer().setInputCol("dataStringWords").setOutputCol("features")

val pipelineTV = new Pipeline().setStages(Array(tokenizer, vectorizer))
val modelTV = pipelineTV.fit(dataset1)

val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType)

val dataset1_TV = modelTV.transform(dataset1).filter(isNoneZeroVector(col("features")))
val dataset2_TV = modelTV.transform(dataset2).filter(isNoneZeroVector(col("features")))

val lsh = new MinHashLSH().setNumHashTables(20).setInputCol("features").setOutputCol("hashValues")
val pipelineLSH = new Pipeline().setStages(Array(lsh))
val modelLSH = pipelineLSH.fit(dataset1_TV)

val dataset1_LSH = modelLSH.transform(dataset1_TV)
val dataset2_LSH = modelLSH.transform(dataset2_TV)

val finalResult = modelLSH.stages.last.asInstanceOf[MinHashLSHModel].approxSimilarityJoin(dataset1_LSH, dataset2_LSH, 0.5)
finalResult.show

如上所述,代码给出了完美的结果,但我的要求是,我必须将 aa'b' 进行比较,即.

As mentioned above code gives perfect result but my requirement is, I have to compare a with a' OR b', ie.

x' + y' + (a' OR b')
x  + y  + (   a    )

这里我不能将这两个数据集连接起来,因为它们没有共同的字段,否则会交叉连接.

Here I cannot join this two datasets as they have no common field, otherwise it will be cross join.

那么有什么方法可以在 Apache Spark 2.2.0 中通过 OR 条件在分组数据中实现字符串相似性.

So is there any way to achieve string similarity with OR condition in grouped data in Apache Spark 2.2.0.

推荐答案

我不认为可以设置两个输入列(每个使用的元素一个 dataStringa'b') 然后在计算时使用 OR 但你可以转换 dataset1 来表示 x' + y' + a'x' + y' + b' 变体,然后进行距离计算.它不会为您提供与根据 dataset2 中的相应行选择 a'b' 完全相同的答案(我认为您知道如何执行这种昂贵的操作)但仍然给人一些相似的感觉.

I don't think that it is possible to set two input columns (one dataString column for each used element a' or b') and then use OR while computing but you can transform dataset1 to represent both x' + y' + a' and x' + y' + b' variants and then do the distance computation. It won't give you exactly the same answer as if you were selecting a' or b' based on the corresponding row in dataset2 (I think you know how to do that expensive operation) but still give some sense of similarity.

val dataset1splitted =
    dataset1
    .withColumn( "a", explode( array( "a'", "b'" ) ) )
    .drop( "a'", "b'", "dataString" )
    .withColumn( "dataString", concat_ws( "|", $"x'", $"y'", $"a" ) )

这篇关于MinHash Spark ML 中 OR 条件的字符串相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆