Spark Mlib FPGrowth 作业因内存错误而失败 [英] Spark Mlib FPGrowth job fails with Memory Error

查看：36 发布时间：2021/11/14 21:00:04 apache-spark rdd apache-spark-mllib

本文介绍了Spark Mlib FPGrowth 作业因内存错误而失败的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个相当简单的用例，但结果集可能非常大.我的代码执行以下操作(在 pyspark shell 上):

I have a fairly simple use case, but potentially very large result set. My code does the following (on pyspark shell):

from pyspark.mllib.fpm import FPGrowth
data = sc.textFile("/Users/me/associationtestproject/data/sourcedata.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.000001, numPartitions=1000)
# Perform any RDD operation
for item in model.freqItemsets().toLocalIterator():
    # do something with item

我发现每当我通过调用 count() 或 toLocalIterator 开始实际处理时，我的操作最终都会以内存不足错误告终.FPGrowth 没有对我的数据进行分区吗?我的结果数据大到连一个分区都会占用我的内存吗?如果是，有没有一种方法可以将 RDD 以流式"方式持久保存到磁盘，而无需尝试将其保存在内存中?

I find that whenever I kick off the actual processing by calling either count() or toLocalIterator, my operation ultimately ends with out of memory error. Is FPGrowth not partitioning my data? Is my result data so big that getting even a single partition chokes up my memory? If yes, is there a way I can persist an RDD to disk in a "streaming" fashion without trying to hold it in memory?

感谢您的见解.

FPGrowth 的一个基本限制是整个 FP 树必须适合内存.因此，关于提高最低支持度阈值的建议是有效的.

A fundamental limitation of FPGrowth is that the entire FP Tree has to fit in memory. So, the suggestions about raising the minimum support threshold are valid.

-拉杰

Spark Mlib FPGrowth 作业因内存错误而失败 [英] Spark Mlib FPGrowth job fails with Memory Error

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark Mlib FPGrowth 作业因内存错误而失败 [英] Spark Mlib FPGrowth job fails with Memory Error

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭