星火MLlib / K-均值直觉 [英] Spark MLlib / K-Means intuition

查看：222 发布时间：2016/5/22 16:01:01 scala apache-spark machine-learning k-means apache-spark-mllib

本文介绍了星火MLlib / K-均值直觉的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我很新的机器学习算法和星火。我跟随
Twitter的数据流语言分类在这里找到：

I'm very new to machine learning algorithms and Spark. I'm follow the Twitter Streaming Language Classifier found here:

<一个href=\"http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html\" rel=\"nofollow\">http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html

具体该code：

<一个href=\"http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala\" rel=\"nofollow\">http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala

除了我试图在批处理模式下的一些鸣叫运行它翻出
卡桑德拉的，在这种情况下，200总鸣叫

Except I'm trying to run it in batch mode on some tweets it pulls out of Cassandra, in this case 200 total tweets.

如示例所示，我使用这个对象量化一套鸣叫：

As the example shows, I am using this object for "vectorizing" a set of tweets:

object Utils{
  val numFeatures = 1000
  val tf = new HashingTF(numFeatures)

  /**
   * Create feature vectors by turning each tweet into bigrams of
   * characters (an n-gram model) and then hashing those to a
   * length-1000 feature vector that we can pass to MLlib.
   * This is a common way to decrease the number of features in a
   * model while still getting excellent accuracy (otherwise every
   * pair of Unicode characters would potentially be a feature).
   */
  def featurize(s: String): Vector = {
    tf.transform(s.sliding(2).toSeq)
  }
}

下面是我这是从ExaminAndTrain.scala修改code：

Here is my code which is modified from ExaminAndTrain.scala:

 val noSets = rawTweets.map(set => set.mkString("\n"))

val vectors = noSets.map(Utils.featurize).cache()
vectors.count()

val numClusters = 5
val numIterations = 30

val model = KMeans.train(vectors, numClusters, numIterations)

  for (i <- 0 until numClusters) {
    println(s"\nCLUSTER $i")
    noSets.foreach {
        t => if (model.predict(Utils.featurize(t)) == 1) {
          println(t)
        }
      }
    }

这code运行，每个集群打印集群0，1群等
与下方没有打印。如果我翻转

This code runs and each Cluster prints "Cluster 0" "Cluster 1" etc with nothing printing beneath. If i flip

models.predict(Utils.featurize(t)) == 1

到

models.predict(Utils.featurize(t)) == 0

同样的事情发生，除了每次鸣叫打印每个簇之下。

the same thing happens except every tweet is printed beneath every cluster.

下面是我直觉地认为正在发生的事情（请纠正我
想，如果它是错误的）：这个code变为每鸣叫入载体，
随机选取一些集群，然后运行到K均值组鸣叫（在
一个非常高的水平，集群，我以为，是常见的
主题）。这样，当它检查每个鸣叫，看是否模型。$ P $ pdict
== 1，不同组的鸣叫应该每个群集下出现（与
因为它的检查对自己的训练集，每次鸣叫
应该是在一个簇）。为什么是不是这样做呢？无论是我的
什么k均值所做的是错误的理解，我的训练集太
或小我缺少的一个步骤。

Here is what I intuitively think is happening (please correct my thinking if its wrong): This code turns each tweet into a vector, randomly picks some clusters, then runs kmeans to group the tweets (at a really high level, the clusters, i assume, would be common "topics"). As such, when it checks each tweet to see if models.predict == 1, different sets of tweets should appear under each cluster (and because its checking the training set against itself, every tweet should be in a cluster). Why isn't it doing this? Either my understanding of what kmeans does is wrong, my training set is too small or I'm missing a step.

任何帮助是极大AP preciated

Any help is greatly appreciated

星火MLlib / K-均值直觉 [英] Spark MLlib / K-Means intuition

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

星火MLlib / K-均值直觉 [英] Spark MLlib / K-Means intuition

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭