Spark Mlib FPGrowth 作业因内存错误而失败 [英] Spark Mlib FPGrowth job fails with Memory Error

查看:36
本文介绍了Spark Mlib FPGrowth 作业因内存错误而失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当简单的用例,但结果集可能非常大.我的代码执行以下操作(在 pyspark shell 上):

I have a fairly simple use case, but potentially very large result set. My code does the following (on pyspark shell):

from pyspark.mllib.fpm import FPGrowth
data = sc.textFile("/Users/me/associationtestproject/data/sourcedata.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.000001, numPartitions=1000)
# Perform any RDD operation
for item in model.freqItemsets().toLocalIterator():
    # do something with item

我发现每当我通过调用 count() 或 toLocalIterator 开始实际处理时,我的操作最终都会以内存不足错误告终.FPGrowth 没有对我的数据进行分区吗?我的结果数据大到连一个分区都会占用我的内存吗?如果是,有没有一种方法可以将 RDD 以流式"方式持久保存到磁盘,而无需尝试将其保存在内存中?

I find that whenever I kick off the actual processing by calling either count() or toLocalIterator, my operation ultimately ends with out of memory error. Is FPGrowth not partitioning my data? Is my result data so big that getting even a single partition chokes up my memory? If yes, is there a way I can persist an RDD to disk in a "streaming" fashion without trying to hold it in memory?

感谢您的见解.

FPGrowth 的一个基本限制是整个 FP 树必须适合内存.因此,关于提高最低支持度阈值的建议是有效的.

A fundamental limitation of FPGrowth is that the entire FP Tree has to fit in memory. So, the suggestions about raising the minimum support threshold are valid.

-拉杰

推荐答案

嗯,问题很可能是支持阈值.当你像这里这样设置一个非常低的值时(我不会称之为百万分之一频繁),你基本上就放弃了向下闭合属性的所有好处.

Well, the problem is most likely a support threshold. When you set a very low value like here (I wouldn't call one-in-a-million frequent) you basically throw away all the benefits of downward-closure property.

这意味着考虑的项集数量呈指数增长,在最坏的情况下,它将等于 2N - 1m,其中 N 是项目的数量.除非您有一个包含非常少项目的玩具数据,否则它根本不可行.

It means that number of itemsets consider is growing exponentially and in the worst case scenario it will be equal to 2N - 1m where N is a number of items. Unless you have a toy data with a very small number of items it is simply not feasible.

编辑:

请注意,在大约 20 万笔交易(从评论中获取的信息)和支持阈值 1e-6 的情况下,数据中的每个项目集都必须频繁出现.所以基本上你在这里要做的是枚举所有观察到的项集.

Note that with ~200K transactions (information taken from the comments) and support threshold 1e-6 every itemset in your data has to be frequent. So basically what you're trying to do here is to enumerate all observed itemsets.

这篇关于Spark Mlib FPGrowth 作业因内存错误而失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆