KMEANS ||关于星火情感分析 [英] KMeans|| for sentiment analysis on Spark

查看:582
本文介绍了KMEANS ||关于星火情感分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编写基于星火情感分析程序。要做到这一点,我使用word2vec和k均值聚类。从word2Vec我有20K字/ 100维空间矢量收集,现在我想clusterize这个向量空间。当我运行默认并行实现k均值算法的工作3小时!但随机初始化的策略就好像8分钟。
我究竟做错了什么?我有MAC亲书机4核处理器和16 GB的RAM。

K〜= 4000
maxInteration为20

  VAR向量:可迭代[org.apache.spark.mllib.linalg.Vector] =
      model.getVectors.map(进入= gt;新建VectorWithLabel(entry._1,entry._2.map(_ toDouble))。)
    VAL数据= sc.parallelize(vectors.toIndexedSeq).persist(StorageLevel.MEMORY_ONLY_2)
    log.info(聚类数据大小{},data.count())
    log.info(==================列车进程启动==================);
    VAL clusterSize = modelSize / 5    VAL k均值=新KMEANS()
    kmeans.setInitializationMode(KMeans.K_MEANS_PARALLEL)
    kmeans.setK(clusterSize)
    kmeans.setRuns(1)
    kmeans.setMaxIterations(50)
    kmeans.setEpsilon(1E-4)    时间= System.currentTimeMillis的()
    VAL clusterModel:KMeansModel = kmeans.run(数据)

和火花上下文初始化是在这里:

  VAL的conf =新SparkConf()
      .setAppName(星火preProcessor)
      .setMaster(本地[4])
      .SET(spark.default.parallelism,8)
      .SET(spark.executor.memory,1G也)
    VAL SC = SparkContext.getOrCreate(CONF)

有关运行这个程序也很少更新。我Intelij IDEA内运行它。我没有真正的星火集群。但我认为,你的个人机可星火集群

我看到这个程序循环中挂起从星火code LocalKMeans.scala:

  //使用K均值++程序采样初始化中心。
    中心(0)= pickWeighted(兰特,点,权重).toDense
    对于(I< - 1至K){
      //选择以概率比例的下一个中心,根据目前的中心成本
      VAL curCenters = centers.view.take(I)
      VAL总和= points.view.zip(重量).MAP {情况下(P,W)=>
        W * KMeans.pointCost(curCenters,P)
      }。和
      VAL R = rand.nextDouble()*总和
      VAR cumulativeScore = 0.0
      变种J = 0
      而(J< points.length&放大器;&安培; cumulativeScore< R){
        cumulativeScore + =重量(J)* KMeans.pointCost(curCenters,分(J))
        J + = 1
      }
      如果(J == 0){
        logWarning(kMeansPlusPlus初始化跑出中心不同点的。+
          s使用重复点为中心K = $ I)。
        中心(ⅰ)=点(0).toDense
      }其他{
        中心(ⅰ)=点(J - 1).toDense
      }
    }


解决方案

我已经运行在AWS上的火花与3站(c3.xlarge),结果都是一样的 - 问题是,平行KMEANS在N个并行运行初始化算法中,但它仍然是少量的数据非常慢,我的解决方案是采用随机初始化contionue。
数据大小约为:为21K 100昏暗的载体4K集群

I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes. What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM.

K ~= 4000 maxInteration was 20

var vectors: Iterable[org.apache.spark.mllib.linalg.Vector] =
      model.getVectors.map(entry => new VectorWithLabel(entry._1, entry._2.map(_.toDouble)))
    val data = sc.parallelize(vectors.toIndexedSeq).persist(StorageLevel.MEMORY_ONLY_2)
    log.info("Clustering data size {}",data.count())
    log.info("==================Train process started==================");
    val clusterSize = modelSize/5

    val kmeans = new KMeans()
    kmeans.setInitializationMode(KMeans.K_MEANS_PARALLEL)
    kmeans.setK(clusterSize)
    kmeans.setRuns(1)
    kmeans.setMaxIterations(50)
    kmeans.setEpsilon(1e-4)

    time = System.currentTimeMillis()
    val clusterModel: KMeansModel = kmeans.run(data)

And spark context initialization is here:

val conf = new SparkConf()
      .setAppName("SparkPreProcessor")
      .setMaster("local[4]")
      .set("spark.default.parallelism", "8")
      .set("spark.executor.memory", "1g")
    val sc = SparkContext.getOrCreate(conf)

Also few updates about running this program. I'm running it inside Intelij IDEA. I don't have real Spark cluster. But I thought that your personal machine can be Spark cluster

I saw that the program hangs inside this loop from Spark code LocalKMeans.scala:

// Initialize centers by sampling using the k-means++ procedure.
    centers(0) = pickWeighted(rand, points, weights).toDense
    for (i <- 1 until k) {
      // Pick the next center with a probability proportional to cost under current centers
      val curCenters = centers.view.take(i)
      val sum = points.view.zip(weights).map { case (p, w) =>
        w * KMeans.pointCost(curCenters, p)
      }.sum
      val r = rand.nextDouble() * sum
      var cumulativeScore = 0.0
      var j = 0
      while (j < points.length && cumulativeScore < r) {
        cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j))
        j += 1
      }
      if (j == 0) {
        logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
          s" Using duplicate point for center k = $i.")
        centers(i) = points(0).toDense
      } else {
        centers(i) = points(j - 1).toDense
      }
    }

解决方案

I've run spark on AWS with 3 slaves (c3.xlarge) and the result is the same - problem is that parallel KMeans initialize algo in N parallel runs, but it's still extremely slow for small amount of data, my solution is to contionue using Random initialization. Data size approximately: 4k clusters for 21k 100-dim vectors.

这篇关于KMEANS ||关于星火情感分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆