MinHash Spark ML中具有OR条件的字符串相似性 [英] String similarity with OR condition in MinHash Spark ML

查看:146
本文介绍了MinHash Spark ML中具有OR条件的字符串相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集,第一个是大型参考数据集,第二个数据集将通过MinHash算法从第一个数据集找到最佳匹配.

I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm.

val dataset1 = 
+-------------+----------+------+------+-----------------------+
|           x'|        y'|    a'|    b'|   dataString(x'+y'+a')|
+-------------+----------+------+------+-----------------------+
|         John|     Smith| 55649| 28200|       John|Smith|55649|
|         Emma|   Morales| 78439| 34200|     Emma|Morales|78439|
|        Janet|  Alvarado| 89488| 29103|   Janet|Alvarado|89488|
|    Elizabeth|         K| 36935| 38101|      Elizabeth|K|36935|
|      Cristin|      Cruz| 75716| 70015|     Cristin|Cruz|75716|
|         Jack|   Colello| 94552| 15609|     Jack|Colello|94552|
|     Anatolie|     Trifa| 63011| 51181|   Anatolie|Trifa|63011|
|      Jaromir|      Plch| 51237| 91798|     Jaromir|Plch|51237|
+-------------+----------+------+------+-----------------------+

// very_large
val dataset2 =
+-------------+----------+------+-----------------------+
|            x|         y|     a|      dataString(x+y+a)|
+-------------+----------+------+-----------------------+
|         John|     Smith| 28200|       John|Smith|28200|
|         Emma|   Morales| 17706|     Emma|Morales|17706|
|        Janet|  Alvarado| 98809|   Janet|Alvarado|98809|
|    Elizabeth|   Keatley| 36935|Elizabeth|Keatley|36935|
|     Cristina|      Cruz| 75716|    Cristina|Cruz|75716|
|         Jake|   Colello| 15609|     Jake|Colello|15609|
|     Anatolie|     Trifa| 63011|   Anatolie|Trifa|63011|
|         Rune|      Eide| 41907|        Rune|Eide|41907|
|    Hortensia|   Brumaru| 33836|Hortensia|Brumaru|33836|
|       Adrien|     Payet| 40463|     Adrien|Payet|40463|
|       Ashley|    Howard| 12445|    Ashley|Howard|12445|
|       Pamela|      Dean| 81311|      Pamela|Dean|81311|
|        Laura|     Calvo| 82682|      Laura|Calvo|82682|
|        Flora|   Parghel| 81206|    Flora|Parghel|81206|
|      Jaromír|      Plch| 91798|     Jaromír|Plch|91798|
+-------------+----------+------+-----------------------+

为创建字符串相似性,创建 | (管道)分开 dataString.

For string similarity, created | (pipe) separated dataString.

这是找到dataString (x' + y' + a')dataString(x + y + a)的相似性的代码,效果很好,

Here is the code for similarity finding of dataString (x' + y' + a') and dataString(x + y + a) which is working fine,

val tokenizer = new RegexTokenizer().setPattern("\\|").setInputCol("dataString").setOutputCol("dataStringWords")
val vectorizer = new CountVectorizer().setInputCol("dataStringWords").setOutputCol("features")

val pipelineTV = new Pipeline().setStages(Array(tokenizer, vectorizer))
val modelTV = pipelineTV.fit(dataset1)

val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType)

val dataset1_TV = modelTV.transform(dataset1).filter(isNoneZeroVector(col("features")))
val dataset2_TV = modelTV.transform(dataset2).filter(isNoneZeroVector(col("features")))

val lsh = new MinHashLSH().setNumHashTables(20).setInputCol("features").setOutputCol("hashValues")
val pipelineLSH = new Pipeline().setStages(Array(lsh))
val modelLSH = pipelineLSH.fit(dataset1_TV)

val dataset1_LSH = modelLSH.transform(dataset1_TV)
val dataset2_LSH = modelLSH.transform(dataset2_TV)

val finalResult = modelLSH.stages.last.asInstanceOf[MinHashLSHModel].approxSimilarityJoin(dataset1_LSH, dataset2_LSH, 0.5)
finalResult.show

如上所述,代码给出了完美的结果,但我的要求是,我必须将aa'b'进行比较,即.

As mentioned above code gives perfect result but my requirement is, I have to compare a with a' OR b', ie.

x' + y' + (a' OR b')
x  + y  + (   a    )

在这里,我无法将这两个数据集连接在一起,因为它们没有公共字段,否则将被交叉连接.

Here I cannot join this two datasets as they have no common field, otherwise it will be cross join.

在Apache Spark 2.2.0中,有什么方法可以使分组数据中的OR条件达到字符串相似性.

So is there any way to achieve string similarity with OR condition in grouped data in Apache Spark 2.2.0.

推荐答案

我认为不可能设置两个输入列(每个使用的元素a'b'都设置一个dataString列),并且然后在计算时使用OR,但您可以转换dataset1来表示x' + y' + a'x' + y' + b'变体,然后进行距离计算.它不会提供与您根据dataset2中的相应行选择a'b'时完全相同的答案(我认为您知道如何执行该昂贵的操作),但仍然会给人以相似的感觉

I don't think that it is possible to set two input columns (one dataString column for each used element a' or b') and then use OR while computing but you can transform dataset1 to represent both x' + y' + a' and x' + y' + b' variants and then do the distance computation. It won't give you exactly the same answer as if you were selecting a' or b' based on the corresponding row in dataset2 (I think you know how to do that expensive operation) but still give some sense of similarity.

val dataset1splitted =
    dataset1
    .withColumn( "a", explode( array( "a'", "b'" ) ) )
    .drop( "a'", "b'", "dataString" )
    .withColumn( "dataString", concat_ws( "|", $"x'", $"y'", $"a" ) )

这篇关于MinHash Spark ML中具有OR条件的字符串相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆