星火MLIB FPGrowth作业失败，内存错误 [英] Spark Mlib FPGrowth job fails with Memory Error

查看：598 发布时间：2016/5/22 15:27:00 apache-spark rdd apache-spark-mllib

本文介绍了星火MLIB FPGrowth作业失败，内存错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个相当简单的用例，但可能非常大的结果集。我的code做以下（在pyspark壳）：

I have a fairly simple use case, but potentially very large result set. My code does the following (on pyspark shell):

from pyspark.mllib.fpm import FPGrowth
data = sc.textFile("/Users/me/associationtestproject/data/sourcedata.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.000001, numPartitions=1000)
# Perform any RDD operation
for item in model.freqItemsets().toLocalIterator():
    # do something with item

我发现每当我通过调用计数（）或toLocalIterator揭开序幕，实际加工中，我的操作最终与内存不足的错误结束。是不是FPGrowth分割我的数据？难道我的结果数据如此之大，甚至让一个分区呛了我的记忆？如果是的话，是有办法，我可以在一个流的方式坚持一个RDD磁盘没有试图保持它在内存中？

I find that whenever I kick off the actual processing by calling either count() or toLocalIterator, my operation ultimately ends with out of memory error. Is FPGrowth not partitioning my data? Is my result data so big that getting even a single partition chokes up my memory? If yes, is there a way I can persist an RDD to disk in a "streaming" fashion without trying to hold it in memory?

感谢您的任何见解。

编辑： FPGrowth的一个基本限制是整个FP树必须适应在内存中。所以，关于提高最低支持度阈值的建议是有效的。

A fundamental limitation of FPGrowth is that the entire FP Tree has to fit in memory. So, the suggestions about raising the minimum support threshold are valid.

-Raj

星火MLIB FPGrowth作业失败，内存错误 [英] Spark Mlib FPGrowth job fails with Memory Error

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

星火MLIB FPGrowth作业失败，内存错误 [英] Spark Mlib FPGrowth job fails with Memory Error

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭