阿帕奇星火MLLib - 运行KMEANS与IDF-TF载体 - Java堆空间 [英] Apache Spark MLLib - Running KMeans with IDF-TF vectors - Java heap space

查看:515
本文介绍了阿帕奇星火MLLib - 运行KMEANS与IDF-TF载体 - Java堆空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从文本文档(TF-IDF向量)的(大)集合上MLLib运行KMEANS。
文件是通过Lucene的英语仪发送,以及稀疏矢量由HashingTF.transform()函数创建。
无论我使用(通过聚结功能)parrallelism程度,KMeans.train总是低于返回OutOfMemory例外。如何解决这一问题的任何想法?

 异常线程mainjava.lang.OutOfMemoryError:Java堆空间
在scala.reflect.ManifestFactory $$不久$ 12.newArray(Manifest.scala:138)
在scala.reflect.ManifestFactory $$不久$ 12.newArray(Manifest.scala:136)
在breeze.linalg.Vector $ class.toArray(Vector.scala:80)
在breeze.linalg.SparseVector.toArray(SparseVector.scala:48)
在breeze.linalg.Vector $ class.toDenseVector(Vector.scala:75)
在breeze.linalg.SparseVector.toDenseVector(SparseVector.scala:48)
在breeze.linalg.Vector $ class.toDenseVector $ MCD $ SP(Vector.scala:74)
在breeze.linalg.SparseVector.toDenseVector $ MCD $ SP(SparseVector.scala:48)
在org.apache.spark.mllib.clustering.BreezeVectorWithNorm.toDense(KMeans.scala:422)
在org.apache.spark.mllib.clustering.KMeans $$ anonfun $ initKMeansParallel $ 1.适用(KMeans.scala:285)
在org.apache.spark.mllib.clustering.KMeans $$ anonfun $ initKMeansParallel $ 1.适用(KMeans.scala:284)
在scala.collection.IndexedSeqOptimized $ class.foreach(IndexedSeqOptimized.scala:33)
在scala.collection.mutable.ArrayOps $ ofRef.foreach(ArrayOps.scala:108)
在org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:284)
在org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
在org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
在org.apache.spark.mllib.clustering.KMeans $ .train(KMeans.scala:338)
在org.apache.spark.mllib.clustering.KMeans $ .train(KMeans.scala:348)


解决方案

一些调查后,事实证明,这个问题涉及到新HashingTF()。变换(V)方法。虽然使用散列伎俩创建稀疏向量是真正有用的(特别是当不知道了一些功能),矢量的必须保持稀疏。对于HashingTF向量默认大小为2 ^ 20。给定一个64位双precision,每个向量理论上需要8MB当转换为密载体 - 无论我们可以运用降维

不幸的是,KMEANS使用的 toDense 的方法(至少对聚类中心),因此导致内存溢出错误(想象与K = 1000)。

 专用高清initRandom(数据:RDD [BreezeVectorWithNorm):数组[数组[BreezeVectorWithNorm] = {
    VAL样本= data.takeSample(true,则在* K,新XORShiftRandom()。nextInt())。toSeq
    Array.tabulate(运行)(R => sample.slice(R * K,(R + 1)* k)的.MAP {V =>
      新BreezeVectorWithNorm(v.vector.toDenseVector,v.norm)
    } .toArray)
  }

I'm trying to run a KMeans on MLLib from a (large) collection of text documents (TF-IDF vectors). Documents are sent through a Lucene English analyzer, and sparse vectors are created from HashingTF.transform() function. Whatever the degree of parrallelism I'm using (through the coalesce function), KMeans.train always return an OutOfMemory exception below. Any thought on how to tackle this issue ?

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)
at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)
at breeze.linalg.Vector$class.toArray(Vector.scala:80)
at breeze.linalg.SparseVector.toArray(SparseVector.scala:48)
at breeze.linalg.Vector$class.toDenseVector(Vector.scala:75)
at breeze.linalg.SparseVector.toDenseVector(SparseVector.scala:48)
at breeze.linalg.Vector$class.toDenseVector$mcD$sp(Vector.scala:74)
at breeze.linalg.SparseVector.toDenseVector$mcD$sp(SparseVector.scala:48)
at org.apache.spark.mllib.clustering.BreezeVectorWithNorm.toDense(KMeans.scala:422)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$initKMeansParallel$1.apply(KMeans.scala:285)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$initKMeansParallel$1.apply(KMeans.scala:284)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:284)
at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338)
at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348)

解决方案

After some investigations, it turns out that this issue was related to new HashingTF().transform(v) method. Although creating sparse vectors using hashing trick is really helpful (especially when the number of features is not known), vector must be kept sparse. Default size for HashingTF vectors is 2^20. Given a 64bits double precision, each vector would theoretically require 8MB when converted to Dense vector - regardless the dimension reduction we could apply.

Sadly, KMeans uses toDense method (at least for the cluster centers), therefore causing OutOfMemory error (imagine with k = 1000).

  private def initRandom(data: RDD[BreezeVectorWithNorm]) : Array[Array[BreezeVectorWithNorm]] = {
    val sample = data.takeSample(true, runs * k, new XORShiftRandom().nextInt()).toSeq
    Array.tabulate(runs)(r => sample.slice(r * k, (r + 1) * k).map { v =>
      new BreezeVectorWithNorm(v.vector.toDenseVector, v.norm)
    }.toArray)
  }

这篇关于阿帕奇星火MLLib - 运行KMEANS与IDF-TF载体 - Java堆空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆