星火MLlib / K-均值直觉 [英] Spark MLlib / K-Means intuition

查看:222
本文介绍了星火MLlib / K-均值直觉的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很新的机器学习算法和星火。我跟随
Twitter的数据流语言分类在这里找到:

I'm very new to machine learning algorithms and Spark. I'm follow the Twitter Streaming Language Classifier found here:

<一个href=\"http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html\" rel=\"nofollow\">http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html

具体该code:

<一个href=\"http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala\" rel=\"nofollow\">http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala

除了我试图在批处理模式下的一些鸣叫运行它翻出
卡桑德拉的,在这种情况下,200总鸣叫

Except I'm trying to run it in batch mode on some tweets it pulls out of Cassandra, in this case 200 total tweets.

如示例所示,我使用这个对象量化一套鸣叫:

As the example shows, I am using this object for "vectorizing" a set of tweets:

object Utils{
  val numFeatures = 1000
  val tf = new HashingTF(numFeatures)

  /**
   * Create feature vectors by turning each tweet into bigrams of
   * characters (an n-gram model) and then hashing those to a
   * length-1000 feature vector that we can pass to MLlib.
   * This is a common way to decrease the number of features in a
   * model while still getting excellent accuracy (otherwise every
   * pair of Unicode characters would potentially be a feature).
   */
  def featurize(s: String): Vector = {
    tf.transform(s.sliding(2).toSeq)
  }
}

下面是我这是从ExaminAndTrain.scala修改code:

Here is my code which is modified from ExaminAndTrain.scala:

 val noSets = rawTweets.map(set => set.mkString("\n"))

val vectors = noSets.map(Utils.featurize).cache()
vectors.count()

val numClusters = 5
val numIterations = 30

val model = KMeans.train(vectors, numClusters, numIterations)

  for (i <- 0 until numClusters) {
    println(s"\nCLUSTER $i")
    noSets.foreach {
        t => if (model.predict(Utils.featurize(t)) == 1) {
          println(t)
        }
      }
    }

这code运行,每个集群打印集群0,1群等
与下方没有打印。如果我翻转

This code runs and each Cluster prints "Cluster 0" "Cluster 1" etc with nothing printing beneath. If i flip

models.predict(Utils.featurize(t)) == 1 

models.predict(Utils.featurize(t)) == 0

同样的事情发生,除了每次鸣叫打印每个簇之下。

the same thing happens except every tweet is printed beneath every cluster.

下面是我直觉地认为正在发生的事情(请纠正我
想,如果它是错误的):这个code变为每鸣叫入载体,
随机选取一些集群,然后运行到K均值组鸣叫(在
一个非常高的水平,集群,我以为,是常见的
主题)。这样,当它检查每个鸣叫,看是否模型。$ ​​P $ pdict
== 1,不同组的鸣叫应该每个群集下出现(与
因为它的检查对自己的训练集,每次鸣叫
应该是在一个簇)。为什么是不是这样做呢?无论是我的
什么k均值所做的是错误的理解,我的训练集太
或小我缺少的一个步骤。

Here is what I intuitively think is happening (please correct my thinking if its wrong): This code turns each tweet into a vector, randomly picks some clusters, then runs kmeans to group the tweets (at a really high level, the clusters, i assume, would be common "topics"). As such, when it checks each tweet to see if models.predict == 1, different sets of tweets should appear under each cluster (and because its checking the training set against itself, every tweet should be in a cluster). Why isn't it doing this? Either my understanding of what kmeans does is wrong, my training set is too small or I'm missing a step.

任何帮助是极大AP preciated

Any help is greatly appreciated

推荐答案

嗯,首先KMEANS的是一个聚类算法,因此不受监督。因此,有没有自相训练集中检查(好吧好吧,你可以做手工;)。

Well, first of all KMeans is a clustering algorithm and as such unsupervised. So there is no "checking of the training set against itself" (well okay you can do it manually ;).

您的理解是相当不错的其实只是你错过了这一模式。predict(Utils.featurize(T))给你那件T属于由KMEANS作为分配集群的地步。我想你要检查

Your understanding is quite good actually, just that you miss the point that model.predict(Utils.featurize(t)) gives you the cluster that t belongs as assigned by KMeans. I think you want to check

模式。predict(Utils.featurize(T))== I

在code,因为我经历了所有类别标签迭代。

in your code since i iterates through all cluster labels.

还有一个小备注:上的字符的鸣叫2克的模型创建的特征向量。这个中间步骤是很重要的;)

Also a small remark: The feature vector is created on a 2-gram model of characters of the tweets. This intermediate step is important ;)

2克(单词)的意思是:熊呼喊在熊=> {(A,熊),(熊,呼喊),(长啸,AT),(AT,A),(熊)},即熊计算两次。字符数是(A,[空格]),([空格],B),(B,E)等。

2-gram (for words) means: "A bear shouts at a bear" => {(A, bear), (bear, shouts), (shouts, at), (at, a), (a bear)} i.e. "a bear" is counted twice. Chars would be (A,[space]), ([space], b), (b, e) and so on.

这篇关于星火MLlib / K-均值直觉的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆