星火余弦相似(点心算法)稀疏输入文件 [英] Spark Cosine Similarity (DIMSUM algorithm ) sparse input file

查看:680
本文介绍了星火余弦相似(点心算法)稀疏输入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道这是否有可能为星火余弦相似性与稀疏输入数据的工作吗?我见过的例子,其中包括输入的形式空格分隔功能线:

I was wondering whether it would be possible for Spark Cosine Similarity to work with Sparse input data? I have seen examples wherein the input consists of lines of space-separated features of the form:

id feat1 feat2 feat3 ...

但我有一种固有的稀疏,隐式反馈设置,并希望有形式输入:

but I have an inherently sparse, implicit feedback setting and would like to have input in the form:

id1 feat1:1 feat5:1 feat10:1
id2 feat3:1 feat5:1 ..
...

我想利用稀疏性,提高了计算。此外,最终我希望用点心算法计算所有节点对相似最近已纳入星火。可能有人认为,将与点心合作的火花稀疏输入格式?我查了例如code和它说的意见的输入必须是一个密集矩阵,但是这code是例子,所以我不知道是不是单指一个特定的情况下做的。

I would like to make use of the sparsity to improve the calculation. Also ultimately I wish to use the DIMSUM algorithm for calculating all-pairs-similarity that has been recently incorporated into Spark. Could someone suggest a sparse-input format that would work with DIMSUM on spark? I checked the example code and in the comments it says "The input must be a dense matrix" but this code was in examples so I don't know whether it was referring only to one particular case.

spark/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

这就是路径的例子code那我指的是。

That's the path to the example code that I'm referring to.

短短几行重新presenting稀疏输入格式应该如何看(从推荐系统的角度来看,USER_ID FE​​AT1:1 FEAT2:1 ...),用余弦相似的工作,将是非常有帮助

Just a couple of lines representing how the sparse-input format should look (from a recommendation system perspective, user_id feat1:1 feat2:1 ...), to work with cosine similarity, would be extremely helpful.

还有那会是好,如果我离开user_ids为字符串?

Also would it be okay if I left the user_ids as strings?

我知道libsvm的格式是类似的,但有没有在这种情况下,用户ID的概念,只能用功能的输入情况,所以我想知道的LIBSVM格式如何转化为一个推荐系统的域名?

I am aware that libsvm format is similar but there is no notion of a user id in this case, only input instances with features so I was wondering how the libsvm format would translate into a recommendation system domain?

我对极其简单化的问题道歉,我非常新的星火和我刚开始我的脚湿了。

My apologies for the extremely simplistic questions, I am extremely new to Spark and am just getting my feet wet.

任何帮助将是非常美联社preciated。在此先感谢!

Any help would be much appreciated. Thanks in advance!

推荐答案

为什么不呢?天真的解决方案,可以看看或多或少是这样的:

Why not? Naive solution can look more or less like this:

// Parse input line
def parseLine(line: String) = {
    def parseFeature(feature: String) = {
        feature.split(":") match {
            case Array(k, v) => (k, v.toDouble)
        }
    }

    val bits = line.split(" ")
    val id = bits.head
    val features = bits.tail.map(parseFeature).toMap
    (id, features)
}

// Compute dot product between to dicts
def dotProduct(x: Map[String, Double], y: Map[String, Double]): Double = ???

// Compute norm of dict
def norm(x: Map[String, Double]): Double = ???

// Compute cosine similarity
def sparseCosine(x: Map[String, Double], y: Map[String, Double]): Double = {
    dotProduct(x, y) / (norm(x) * norm(y))
}

// Parse input lines
val parsed  = sc.textFile("features.txt").map(parseLine)

// Find unique pairs
val pairs = parsed.cartesian(parsed).filter(x => x._1._1 != x._2._1)

// Compute cosine similarity between pairs
pairs.map { case ((k1, m1), (k2, m2)) => ((k1, k2), sparseCosine(m1, m2)) } 

这篇关于星火余弦相似(点心算法)稀疏输入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆