星火余弦相似（点心算法）稀疏输入文件 [英] Spark Cosine Similarity (DIMSUM algorithm ) sparse input file

查看：680 发布时间：2016/5/22 16:05:34 apache-spark sparse-matrix cosine-similarity

本文介绍了星火余弦相似（点心算法）稀疏输入文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我不知道这是否有可能为星火余弦相似性与稀疏输入数据的工作吗？我见过的例子，其中包括输入的形式空格分隔功能线：

I was wondering whether it would be possible for Spark Cosine Similarity to work with Sparse input data? I have seen examples wherein the input consists of lines of space-separated features of the form:

id feat1 feat2 feat3 ...

但我有一种固有的稀疏，隐式反馈设置，并希望有形式输入：

but I have an inherently sparse, implicit feedback setting and would like to have input in the form:

id1 feat1:1 feat5:1 feat10:1
id2 feat3:1 feat5:1 ..
...

我想利用稀疏性，提高了计算。此外，最终我希望用点心算法计算所有节点对相似最近已纳入星火。可能有人认为，将与点心合作的火花稀疏输入格式？我查了例如code和它说的意见的输入必须是一个密集矩阵，但是这code是例子，所以我不知道是不是单指一个特定的情况下做的。

I would like to make use of the sparsity to improve the calculation. Also ultimately I wish to use the DIMSUM algorithm for calculating all-pairs-similarity that has been recently incorporated into Spark. Could someone suggest a sparse-input format that would work with DIMSUM on spark? I checked the example code and in the comments it says "The input must be a dense matrix" but this code was in examples so I don't know whether it was referring only to one particular case.

spark/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

这就是路径的例子code那我指的是。

That's the path to the example code that I'm referring to.

短短几行重新presenting稀疏输入格式应该如何看（从推荐系统的角度来看，USER_ID FEAT1：1 FEAT2：1 ...），用余弦相似的工作，将是非常有帮助

Just a couple of lines representing how the sparse-input format should look (from a recommendation system perspective, user_id feat1:1 feat2:1 ...), to work with cosine similarity, would be extremely helpful.

还有那会是好，如果我离开user_ids为字符串？

Also would it be okay if I left the user_ids as strings?

我知道libsvm的格式是类似的，但有没有在这种情况下，用户ID的概念，只能用功能的输入情况，所以我想知道的LIBSVM格式如何转化为一个推荐系统的域名？

I am aware that libsvm format is similar but there is no notion of a user id in this case, only input instances with features so I was wondering how the libsvm format would translate into a recommendation system domain?

我对极其简单化的问题道歉，我非常新的星火和我刚开始我的脚湿了。

My apologies for the extremely simplistic questions, I am extremely new to Spark and am just getting my feet wet.

任何帮助将是非常美联社preciated。在此先感谢！

Any help would be much appreciated. Thanks in advance!

推荐答案

为什么不呢？天真的解决方案，可以看看或多或少是这样的：

Why not? Naive solution can look more or less like this:

// Parse input line
def parseLine(line: String) = {
    def parseFeature(feature: String) = {
        feature.split(":") match {
            case Array(k, v) => (k, v.toDouble)
        }
    }

    val bits = line.split(" ")
    val id = bits.head
    val features = bits.tail.map(parseFeature).toMap
    (id, features)
}

// Compute dot product between to dicts
def dotProduct(x: Map[String, Double], y: Map[String, Double]): Double = ???

// Compute norm of dict
def norm(x: Map[String, Double]): Double = ???

// Compute cosine similarity
def sparseCosine(x: Map[String, Double], y: Map[String, Double]): Double = {
    dotProduct(x, y) / (norm(x) * norm(y))
}

// Parse input lines
val parsed  = sc.textFile("features.txt").map(parseLine)

// Find unique pairs
val pairs = parsed.cartesian(parsed).filter(x => x._1._1 != x._2._1)

// Compute cosine similarity between pairs
pairs.map { case ((k1, m1), (k2, m2)) => ((k1, k2), sparseCosine(m1, m2)) }

这篇关于星火余弦相似（点心算法）稀疏输入文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

星火余弦相似（点心算法）稀疏输入文件 [英] Spark Cosine Similarity (DIMSUM algorithm ) sparse input file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

星火余弦相似（点心算法）稀疏输入文件 [英] Spark Cosine Similarity (DIMSUM algorithm ) sparse input file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭