有关MongoDB数据运行亨利马乌RowSimilarity推荐 [英] Run Mahout RowSimilarity recommender on MongoDB data

查看:432
本文介绍了有关MongoDB数据运行亨利马乌RowSimilarity推荐的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已成功地对以下格式的平面文件运行亨利马乌rowsimilarity:

I have managed to run Mahout rowsimilarity on flat files of below format:

项-ID标签1标签2 TAG3

item-id tag1 tag-2 tag3

此,必须通过CLI运行,并且再次输出是平面文件。我想使这使得它读取数据的MongoDB(开放使用其他数据块也行),然后转储输出到数据库,然后可以从我们的系统中挑选出来的。

This has to be run via cli and the output is again flat files. I want to make this such that it reads data from MongoDB (open to using other DBs too) and then dumps the output to DB which can then be picked from our system.

我已经研究了过去的几天,下面的事情找到:

I've researched for past few days and found below things:


  • 将必须编写的Scala code实现RowSimilarity

  • 传递一个IndexedDataSet对象来处理数据

  • 转换的输出要求的格式(JSON / CSV)

我还没有搞清楚的是我如何去从数据库导入数据到IndexedDataSet什么。另外,我读过有关RDD格式,但还是无法弄清楚如何JSON数据RDD可以通过RowSimilarity code使用转换。

What I'm yet to figure out is how do I go about importing data from DB to IndexedDataSet. Also, I've read about RDD format and still can't figure out how to convert json data to RDD which can be used by RowSimilarity code.

TL;博士:如何转换MongoDB的数据,以便它可以通过象夫/火花rowsimilarity处理

EDIT1:我已经发现了一些code,从这个链接转换为Mongodata RDD:的 https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage#scala-example

I have found some code that converts Mongodata to RDD from this link: https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage#scala-example

现在我需要帮助将其转换为IndexedDataset,以便它可以被传递给SimilarityAnalysis.rowSimilarityIDS

Now I need help to convert it to IndexedDataset so that it can be passed to SimilarityAnalysis.rowSimilarityIDS.

TL;博士:我如何转换为RDD IndexedDataset

推荐答案

下面就是答案:

import org.apache.hadoop.conf.Configuration
import org.apache.mahout.math.cf.SimilarityAnalysis
import org.apache.mahout.math.indexeddataset.Schema
import org.apache.mahout.sparkbindings
import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
import org.apache.spark.rdd.RDD
import org.bson.BSONObject
import com.mongodb.hadoop.MongoInputFormat


object SparkExample extends App {
  implicit val mc = sparkbindings.mahoutSparkContext(masterUrl = "local", appName = "RowSimilarity")
  val mongoConfig = new Configuration()
  mongoConfig.set("mongo.input.uri", "mongodb://hostname:27017/db.collection")

  val documents: RDD[(Object, BSONObject)] = mc.newAPIHadoopRDD(
    mongoConfig,
    classOf[MongoInputFormat],
    classOf[Object],
    classOf[BSONObject]
  )

  val documents_Array: RDD[(String, Array[String])] = documents.map(
    doc1 => (
      doc1._2.get("product_id").toString(),
      doc1._2.get("product_attribute_value").toString().replace("[ \"", "").replace("\"]", "").split("\" , \"").map(value => value.toLowerCase.replace(" ", "-").mkString(" "))
    )
  )

  val new_doc: RDD[(String, String)] = documents_Array.flatMapValues(x => x)
  val myIDs = IndexedDatasetSpark(new_doc)(mc)

  val readWriteSchema = new Schema(
    "rowKeyDelim" -> "\t",
    "columnIdStrengthDelim" -> ":",
    "omitScore" -> false,
    "elementDelim" -> " "
  )
  SimilarityAnalysis.rowSimilarityIDS(myIDs).dfsWrite("hdfs://hadoop:9000/mongo-hadoop-rowsimilarity", readWriteSchema)(mc)

}

build.sbt:

build.sbt:

name := "scala-mongo"
version := "1.0"
scalaVersion := "2.10.6"
libraryDependencies += "org.mongodb" %% "casbah" % "3.1.1"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.mongodb.mongo-hadoop" % "mongo-hadoop-core" % "1.4.2"

libraryDependencies ++= Seq(
  "org.apache.hadoop" % "hadoop-client" % "2.6.0" exclude("javax.servlet", "servlet-api") exclude ("com.sun.jmx", "jmxri") exclude ("com.sun.jdmk", "jmxtools") exclude ("javax.jms", "jms") exclude ("org.slf4j", "slf4j-log4j12") exclude("hsqldb","hsqldb"),
  "org.scalatest" % "scalatest_2.10" % "1.9.2" % "test"
)
libraryDependencies += "org.apache.mahout" % "mahout-math-scala_2.10" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-spark_2.10" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-math" % "0.11.2"
libraryDependencies += "org.apache.mahout" % "mahout-hdfs" % "0.11.2"

resolvers += "typesafe repo" at " http://repo.typesafe.com/typesafe/releases/"
resolvers += Resolver.mavenLocal

我用蒙戈 - Hadoop的从获取数据蒙戈并使用它。由于我的数据有一个数组,我不得不用flatMapValues​​压平,然后传递给IDS的正确的输出。

I've used mongo-hadoop to get data from Mongo and use it. Since my data had an array, I had to use flatMapValues to flatten it and then pass to IDS for proper output.

PS:我张贴的答案在这里,而不是<一个href=\"http://stackoverflow.com/questions/37073140/scala-create-indexeddatasetspark-object?lq=1\">linked问题因为这个Q&安培; A占地面积获取数据并进行处理的全部范围

PS: I posted the answer here and not the linked question because this Q&A covers the full scope of getting data and processing it.

这篇关于有关MongoDB数据运行亨利马乌RowSimilarity推荐的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆