K均值||用于 Spark 上的情感分析 [英] KMeans|| for sentiment analysis on Spark

查看：30 发布时间：2021/11/14 21:10:34 scala apache-spark machine-learning k-means apache-spark-mllib

本文介绍了K均值||用于 Spark 上的情感分析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试编写基于 Spark 的情感分析程序.为此，我使用 word2vec 和 KMeans 聚类.从 word2Vec 中，我在 100 维空间中有 20k 个词/向量集合，现在我正在尝试对这个向量空间进行聚类.当我使用默认并行实现运行 KMeans 时，算法工作了 3 个小时！但是使用随机初始化策略，它就像 8 分钟.我究竟做错了什么?我有一台配备 4 个内核处理器和 16 GB 内存的 mac book pro 机器.

I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes. What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM.

K ~= 4000最大交互为 20

K ~= 4000 maxInteration was 20

var vectors: Iterable[org.apache.spark.mllib.linalg.Vector] =
      model.getVectors.map(entry => new VectorWithLabel(entry._1, entry._2.map(_.toDouble)))
    val data = sc.parallelize(vectors.toIndexedSeq).persist(StorageLevel.MEMORY_ONLY_2)
    log.info("Clustering data size {}",data.count())
    log.info("==================Train process started==================");
    val clusterSize = modelSize/5

    val kmeans = new KMeans()
    kmeans.setInitializationMode(KMeans.K_MEANS_PARALLEL)
    kmeans.setK(clusterSize)
    kmeans.setRuns(1)
    kmeans.setMaxIterations(50)
    kmeans.setEpsilon(1e-4)

    time = System.currentTimeMillis()
    val clusterModel: KMeansModel = kmeans.run(data)

火花上下文初始化在这里:

And spark context initialization is here:

val conf = new SparkConf()
      .setAppName("SparkPreProcessor")
      .setMaster("local[4]")
      .set("spark.default.parallelism", "8")
      .set("spark.executor.memory", "1g")
    val sc = SparkContext.getOrCreate(conf)

关于运行此程序的更新也很少.我在 Intelij IDEA 中运行它.我没有真正的 Spark 集群.不过我觉得你的个人机器可以是Spark集群

Also few updates about running this program. I'm running it inside Intelij IDEA. I don't have real Spark cluster. But I thought that your personal machine can be Spark cluster

我从 Spark 代码 LocalKMeans.scala 看到程序挂在这个循环中:

I saw that the program hangs inside this loop from Spark code LocalKMeans.scala:

// Initialize centers by sampling using the k-means++ procedure.
    centers(0) = pickWeighted(rand, points, weights).toDense
    for (i <- 1 until k) {
      // Pick the next center with a probability proportional to cost under current centers
      val curCenters = centers.view.take(i)
      val sum = points.view.zip(weights).map { case (p, w) =>
        w * KMeans.pointCost(curCenters, p)
      }.sum
      val r = rand.nextDouble() * sum
      var cumulativeScore = 0.0
      var j = 0
      while (j < points.length && cumulativeScore < r) {
        cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j))
        j += 1
      }
      if (j == 0) {
        logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
          s" Using duplicate point for center k = $i.")
        centers(i) = points(0).toDense
      } else {
        centers(i) = points(j - 1).toDense
      }
    }

K均值||用于 Spark 上的情感分析 [英] KMeans|| for sentiment analysis on Spark

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

K均值||用于 Spark 上的情感分析 [英] KMeans|| for sentiment analysis on Spark

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭