通过使用tf-idf将文本特征化为向量来计算余弦相似度 [英] Calculating cosine similarity by featurizing the text into vector using tf-idf

查看:76
本文介绍了通过使用tf-idf将文本特征化为向量来计算余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Apache Spark的新手,想从一堆文本中查找相似的文本,尝试了如下操作-

I'm new to Apache Spark, want to find the similar text from a bunch of text, have tried myself as follows -

我有2个RDD-

第一个RDD包含不完整的文本,如下所示-

1st RDD contain incomplete text as follows -

[0,541 Suite 204, Redwood City, CA 94063]
[1,6649 N Blue Gum St, New Orleans,LA, 70116]
[2,#69, Los Angeles, Los Angeles, CA, 90034]
[3,98 Connecticut Ave Nw, Chagrin Falls]
[4,56 E Morehead Webb, TX, 78045]

第二个RDD包含正确的地址,如下所示-

2nd RDD contain correct address as follows -

[0,541 Jefferson Avenue, Suite 204, Redwood City, CA 94063]
[1,6649 N Blue Gum St, New Orleans, Orleans, LA, 70116]
[2,25 E 75th St #69, Los Angeles, Los Angeles, CA, 90034]
[3,98 Connecticut Ave Nw, Chagrin Falls, Geauga, OH, 44023]
[4,56 E Morehead St, Laredo, Webb, TX, 78045]

已经编写了这段代码,这花费了很多时间,有人可以告诉我使用scala在Apache Spark中执行此操作的正确方法.

Have written this code, it is taking lot of time, can anyone please tell me the correct way of doing this in Apache Spark using scala.

val incorrect_address_count = incorrect_address_rdd.count()
val all_address = incorrect_address_rdd.union(correct_address_rdd) map (_._2.split(" ").toSeq)

val hashingTF = new HashingTF()
val tf = hashingTF.transform(all_address)
.zipWithIndex()

val input_vector_rdd = tf.filter(_._2 < incorrect_address_count)

val address_db_vector_rdd = tf.filter(_._2 >= incorrect_address_countt)
.map(f => (f._2 - input_count, f._1))
.join(correct_address_rdd)
.map(f => (f._2._1, f._2._2))

val input_similarity_rdd = input_vector_rdd.cartesian(address_db_vector_rdd)
.map(f => {

val cosine_similarity = cosineSimilarity(f._1._1.toDense, f._2._1.toDense)

(f._1._2, cosine_similarity, f._2._2)
})


def cosineSimilarity(vectorA: Vector, vectorB: Vector) = {

var dotProduct = 0.0
var normA = 0.0
var normB = 0.0
var index = vectorA.size - 1

for (i <- 0 to index) {
dotProduct += vectorA(i) * vectorB(i)
normA += Math.pow(vectorA(i), 2)
normB += Math.pow(vectorB(i), 2)
}
(dotProduct / (Math.sqrt(normA) * Math.sqrt(normB)))
}

推荐答案

我遇到了几乎相同的问题.我有370K行,每行有2个向量分别为300K和400K.我正在将测试rdd行与这两个向量相乘.

I had nearly same problem. I had 370K row and 2 vectors of 300K and 400K for each row. I am multiplying test rdd rows with both these vectors.

您可以进行2大改进.一种是预先计算规范.他们没有改变.其次是使用稀疏向量.如果你这样做的话,你使用的是vector.size,它的大小是300K.如果您使用稀疏,则每个关键字都会迭代.(每行20-30).

There are 2 big improvements you can do. One is pre-calculate norms. They do not change. Second is use sparse vector. You go with vector.size it is 300K if you do it like that. If you use Sparse it is iterating for every keyword.(20-30 per row).

我还担心这是最有效的方法,因为计算不需要洗牌.如果您最后有一个很好的估计,您可以按分数过滤,结果会很快.(我的意思是哪个分数对您来说足够了.)

Also I am afraid this is most efficient way because calculations do not need to shuffle.If you have a good estimation at the end you can filter by score and things will be fast.(I mean which score is enough for you.)

def cosineSimilarity(vectorA: SparseVector, vectorB:SparseVector,normASqrt:Double,normBSqrt:Double) :(Double,Double) = {
  var dotProduct = 0.0
  for (i <-  vectorA.indices){ 
    dotProduct += vectorA(i) * vectorB(i)
  }
  val div = (normASqrt * normBSqrt)
  if( div == 0 )
    (dotProduct,0)
  else
    (dotProduct,dotProduct / div)
}

这篇关于通过使用tf-idf将文本特征化为向量来计算余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆