Apache Spark MLLib - 使用 IDF-TF 向量运行 KMeans - Java 堆空间 [英] Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

查看:20
本文介绍了Apache Spark MLLib - 使用 IDF-TF 向量运行 KMeans - Java 堆空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从(大型)文本文档(TF-IDF 向量)集合中在 MLLib 上运行 KMeans.文档通过 Lucene 英语分析器发送,稀疏向量由 HashingTF.transform() 函数创建.无论我使用的并行度如何(通过合并函数),KMeans.train 总是在下面返回一个 OutOfMemory 异常.关于如何解决这个问题的任何想法?

I'm trying to run a KMeans on MLLib from a (large) collection of text documents (TF-IDF vectors). Documents are sent through a Lucene English analyzer, and sparse vectors are created from HashingTF.transform() function. Whatever the degree of parrallelism I'm using (through the coalesce function), KMeans.train always return an OutOfMemory exception below. Any thought on how to tackle this issue ?

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)
at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)
at breeze.linalg.Vector$class.toArray(Vector.scala:80)
at breeze.linalg.SparseVector.toArray(SparseVector.scala:48)
at breeze.linalg.Vector$class.toDenseVector(Vector.scala:75)
at breeze.linalg.SparseVector.toDenseVector(SparseVector.scala:48)
at breeze.linalg.Vector$class.toDenseVector$mcD$sp(Vector.scala:74)
at breeze.linalg.SparseVector.toDenseVector$mcD$sp(SparseVector.scala:48)
at org.apache.spark.mllib.clustering.BreezeVectorWithNorm.toDense(KMeans.scala:422)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$initKMeansParallel$1.apply(KMeans.scala:285)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$initKMeansParallel$1.apply(KMeans.scala:284)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:284)
at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338)
at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348)

推荐答案

经过一番调查,发现这个问题与 new HashingTF().transform(v) 方法有关.尽管使用散列技巧创建稀疏向量确实很有帮助(尤其是当特征数量未知时),但向量必须保持稀疏.HashingTF 向量的默认大小为 2^20.给定 64 位双精度,理论上每个向量在转换为密集向量时需要 8MB - 无论我们可以应用降维.

After some investigations, it turns out that this issue was related to new HashingTF().transform(v) method. Although creating sparse vectors using hashing trick is really helpful (especially when the number of features is not known), vector must be kept sparse. Default size for HashingTF vectors is 2^20. Given a 64bits double precision, each vector would theoretically require 8MB when converted to Dense vector - regardless the dimension reduction we could apply.

遗憾的是,KMeans 使用 toDense 方法(至少对于聚类中心而言),因此导致 OutOfMemory 错误(假设 k = 1000).

Sadly, KMeans uses toDense method (at least for the cluster centers), therefore causing OutOfMemory error (imagine with k = 1000).

  private def initRandom(data: RDD[BreezeVectorWithNorm]) : Array[Array[BreezeVectorWithNorm]] = {
    val sample = data.takeSample(true, runs * k, new XORShiftRandom().nextInt()).toSeq
    Array.tabulate(runs)(r => sample.slice(r * k, (r + 1) * k).map { v =>
      new BreezeVectorWithNorm(v.vector.toDenseVector, v.norm)
    }.toArray)
  }

这篇关于Apache Spark MLLib - 使用 IDF-TF 向量运行 KMeans - Java 堆空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆