KMEANS ||关于星火情感分析 [英] KMeans|| for sentiment analysis on Spark
问题描述
我想编写基于星火情感分析程序。要做到这一点,我使用word2vec和k均值聚类。从word2Vec我有20K字/ 100维空间矢量收集,现在我想clusterize这个向量空间。当我运行默认并行实现k均值算法的工作3小时!但随机初始化的策略就好像8分钟。
我究竟做错了什么?我有MAC亲书机4核处理器和16 GB的RAM。
K〜= 4000
maxInteration为20
VAR向量:可迭代[org.apache.spark.mllib.linalg.Vector] =
model.getVectors.map(进入= gt;新建VectorWithLabel(entry._1,entry._2.map(_ toDouble))。)
VAL数据= sc.parallelize(vectors.toIndexedSeq).persist(StorageLevel.MEMORY_ONLY_2)
log.info(聚类数据大小{},data.count())
log.info(==================列车进程启动==================);
VAL clusterSize = modelSize / 5 VAL k均值=新KMEANS()
kmeans.setInitializationMode(KMeans.K_MEANS_PARALLEL)
kmeans.setK(clusterSize)
kmeans.setRuns(1)
kmeans.setMaxIterations(50)
kmeans.setEpsilon(1E-4) 时间= System.currentTimeMillis的()
VAL clusterModel:KMeansModel = kmeans.run(数据)
和火花上下文初始化是在这里:
VAL的conf =新SparkConf()
.setAppName(星火preProcessor)
.setMaster(本地[4])
.SET(spark.default.parallelism,8)
.SET(spark.executor.memory,1G也)
VAL SC = SparkContext.getOrCreate(CONF)
有关运行这个程序也很少更新。我Intelij IDEA内运行它。我没有真正的星火集群。但我认为,你的个人机可星火集群
我看到这个程序循环中挂起从星火code LocalKMeans.scala:
//使用K均值++程序采样初始化中心。
中心(0)= pickWeighted(兰特,点,权重).toDense
对于(I< - 1至K){
//选择以概率比例的下一个中心,根据目前的中心成本
VAL curCenters = centers.view.take(I)
VAL总和= points.view.zip(重量).MAP {情况下(P,W)=>
W * KMeans.pointCost(curCenters,P)
}。和
VAL R = rand.nextDouble()*总和
VAR cumulativeScore = 0.0
变种J = 0
而(J< points.length&放大器;&安培; cumulativeScore< R){
cumulativeScore + =重量(J)* KMeans.pointCost(curCenters,分(J))
J + = 1
}
如果(J == 0){
logWarning(kMeansPlusPlus初始化跑出中心不同点的。+
s使用重复点为中心K = $ I)。
中心(ⅰ)=点(0).toDense
}其他{
中心(ⅰ)=点(J - 1).toDense
}
}
我已经运行在AWS上的火花与3站(c3.xlarge),结果都是一样的 - 问题是,平行KMEANS在N个并行运行初始化算法中,但它仍然是少量的数据非常慢,我的解决方案是采用随机初始化contionue。
数据大小约为:为21K 100昏暗的载体4K集群
I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes. What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM.
K ~= 4000 maxInteration was 20
var vectors: Iterable[org.apache.spark.mllib.linalg.Vector] =
model.getVectors.map(entry => new VectorWithLabel(entry._1, entry._2.map(_.toDouble)))
val data = sc.parallelize(vectors.toIndexedSeq).persist(StorageLevel.MEMORY_ONLY_2)
log.info("Clustering data size {}",data.count())
log.info("==================Train process started==================");
val clusterSize = modelSize/5
val kmeans = new KMeans()
kmeans.setInitializationMode(KMeans.K_MEANS_PARALLEL)
kmeans.setK(clusterSize)
kmeans.setRuns(1)
kmeans.setMaxIterations(50)
kmeans.setEpsilon(1e-4)
time = System.currentTimeMillis()
val clusterModel: KMeansModel = kmeans.run(data)
And spark context initialization is here:
val conf = new SparkConf()
.setAppName("SparkPreProcessor")
.setMaster("local[4]")
.set("spark.default.parallelism", "8")
.set("spark.executor.memory", "1g")
val sc = SparkContext.getOrCreate(conf)
Also few updates about running this program. I'm running it inside Intelij IDEA. I don't have real Spark cluster. But I thought that your personal machine can be Spark cluster
I saw that the program hangs inside this loop from Spark code LocalKMeans.scala:
// Initialize centers by sampling using the k-means++ procedure.
centers(0) = pickWeighted(rand, points, weights).toDense
for (i <- 1 until k) {
// Pick the next center with a probability proportional to cost under current centers
val curCenters = centers.view.take(i)
val sum = points.view.zip(weights).map { case (p, w) =>
w * KMeans.pointCost(curCenters, p)
}.sum
val r = rand.nextDouble() * sum
var cumulativeScore = 0.0
var j = 0
while (j < points.length && cumulativeScore < r) {
cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j))
j += 1
}
if (j == 0) {
logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
s" Using duplicate point for center k = $i.")
centers(i) = points(0).toDense
} else {
centers(i) = points(j - 1).toDense
}
}
I've run spark on AWS with 3 slaves (c3.xlarge) and the result is the same - problem is that parallel KMeans initialize algo in N parallel runs, but it's still extremely slow for small amount of data, my solution is to contionue using Random initialization. Data size approximately: 4k clusters for 21k 100-dim vectors.
这篇关于KMEANS ||关于星火情感分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!