KMEANS ||关于星火情感分析 [英] KMeans|| for sentiment analysis on Spark

查看：582 发布时间：2016/5/22 16:31:59 scala apache-spark machine-learning k-means apache-spark-mllib

本文介绍了KMEANS ||关于星火情感分析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想编写基于星火情感分析程序。要做到这一点，我使用word2vec和k均值聚类。从word2Vec我有20K字/ 100维空间矢量收集，现在我想clusterize这个向量空间。当我运行默认并行实现k均值算法的工作3小时！但随机初始化的策略就好像8分钟。
我究竟做错了什么？我有MAC亲书机4核处理器和16 GB的RAM。

K〜= 4000
maxInteration为20

  VAR向量：可迭代[org.apache.spark.mllib.linalg.Vector] =
      model.getVectors.map（进入= gt;新建VectorWithLabel（entry._1，entry._2.map（_ toDouble））。）
    VAL数据= sc.parallelize（vectors.toIndexedSeq）.persist（StorageLevel.MEMORY_ONLY_2）
    log.info（聚类数据大小{}，data.count（））
    log.info（==================列车进程启动==================）;
    VAL clusterSize = modelSize / 5    VAL k均值=新KMEANS（）
    kmeans.setInitializationMode（KMeans.K_MEANS_PARALLEL）
    kmeans.setK（clusterSize）
    kmeans.setRuns（1）
    kmeans.setMaxIterations（50）
    kmeans.setEpsilon（1E-4）    时间= System.currentTimeMillis的（）
    VAL clusterModel：KMeansModel = kmeans.run（数据）

和火花上下文初始化是在这里：

  VAL的conf =新SparkConf（）
      .setAppName（星火preProcessor）
      .setMaster（本地[4]）
      .SET（spark.default.parallelism，8）
      .SET（spark.executor.memory，1G也）
    VAL SC = SparkContext.getOrCreate（CONF）

有关运行这个程序也很少更新。我Intelij IDEA内运行它。我没有真正的星火集群。但我认为，你的个人机可星火集群

我看到这个程序循环中挂起从星火code LocalKMeans.scala：

  //使用K均值++程序采样初始化中心。
    中心（0）= pickWeighted（兰特，点，权重）.toDense
    对于（I＆LT;  -  1至K）{
      //选择以概率比例的下一个中心，根据目前的中心成本
      VAL curCenters = centers.view.take（I）
      VAL总和= points.view.zip（重量）.MAP {情况下（P，W）=＆GT;
        W * KMeans.pointCost（curCenters，P）
      }。和
      VAL R = rand.nextDouble（）*总和
      VAR cumulativeScore = 0.0
      变种J = 0
      而（J＆LT; points.length＆放大器;＆安培; cumulativeScore＆LT; R）{
        cumulativeScore + =重量（J）* KMeans.pointCost（curCenters，分（J））
        J + = 1
      }
      如果（J == 0）{
        logWarning（kMeansPlusPlus初始化跑出中心不同点的。+
          s使用重复点为中心K = $ I）。
        中心（ⅰ）=点（0）.toDense
      }其他{
        中心（ⅰ）=点（J  -  1）.toDense
      }
    }

解决方案

我已经运行在AWS上的火花与3站（c3.xlarge），结果都是一样的 - 问题是，平行KMEANS在N个并行运行初始化算法中，但它仍然是少量的数据非常慢，我的解决方案是采用随机初始化contionue。
数据大小约为：为21K 100昏暗的载体4K集群

I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes. What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM.

K ~= 4000 maxInteration was 20

var vectors: Iterable[org.apache.spark.mllib.linalg.Vector] =
      model.getVectors.map(entry => new VectorWithLabel(entry._1, entry._2.map(_.toDouble)))
    val data = sc.parallelize(vectors.toIndexedSeq).persist(StorageLevel.MEMORY_ONLY_2)
    log.info("Clustering data size {}",data.count())
    log.info("==================Train process started==================");
    val clusterSize = modelSize/5

    val kmeans = new KMeans()
    kmeans.setInitializationMode(KMeans.K_MEANS_PARALLEL)
    kmeans.setK(clusterSize)
    kmeans.setRuns(1)
    kmeans.setMaxIterations(50)
    kmeans.setEpsilon(1e-4)

    time = System.currentTimeMillis()
    val clusterModel: KMeansModel = kmeans.run(data)

And spark context initialization is here:

val conf = new SparkConf()
      .setAppName("SparkPreProcessor")
      .setMaster("local[4]")
      .set("spark.default.parallelism", "8")
      .set("spark.executor.memory", "1g")
    val sc = SparkContext.getOrCreate(conf)

Also few updates about running this program. I'm running it inside Intelij IDEA. I don't have real Spark cluster. But I thought that your personal machine can be Spark cluster

I saw that the program hangs inside this loop from Spark code LocalKMeans.scala:

// Initialize centers by sampling using the k-means++ procedure.
    centers(0) = pickWeighted(rand, points, weights).toDense
    for (i <- 1 until k) {
      // Pick the next center with a probability proportional to cost under current centers
      val curCenters = centers.view.take(i)
      val sum = points.view.zip(weights).map { case (p, w) =>
        w * KMeans.pointCost(curCenters, p)
      }.sum
      val r = rand.nextDouble() * sum
      var cumulativeScore = 0.0
      var j = 0
      while (j < points.length && cumulativeScore < r) {
        cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j))
        j += 1
      }
      if (j == 0) {
        logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
          s" Using duplicate point for center k = $i.")
        centers(i) = points(0).toDense
      } else {
        centers(i) = points(j - 1).toDense
      }
    }

解决方案

I've run spark on AWS with 3 slaves (c3.xlarge) and the result is the same - problem is that parallel KMeans initialize algo in N parallel runs, but it's still extremely slow for small amount of data, my solution is to contionue using Random initialization. Data size approximately: 4k clusters for 21k 100-dim vectors.

这篇关于KMEANS ||关于星火情感分析的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

KMEANS ||关于星火情感分析 [英] KMeans|| for sentiment analysis on Spark

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

KMEANS ||关于星火情感分析 [英] KMeans|| for sentiment analysis on Spark

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭