星火MLIB FPGrowth作业失败,内存错误 [英] Spark Mlib FPGrowth job fails with Memory Error

查看:598
本文介绍了星火MLIB FPGrowth作业失败,内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当简单的用例,但可能非常大的结果集。我的code做以下(在pyspark壳):

I have a fairly simple use case, but potentially very large result set. My code does the following (on pyspark shell):

from pyspark.mllib.fpm import FPGrowth
data = sc.textFile("/Users/me/associationtestproject/data/sourcedata.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.000001, numPartitions=1000)
# Perform any RDD operation
for item in model.freqItemsets().toLocalIterator():
    # do something with item

我发现每当我通过调用计数()或toLocalIterator揭开序幕,实际加工中,我的操作最终与内存不足的错误结束。是不是FPGrowth分割我的数据?难道我的结果数据如此之大,甚至让一个分区呛了我的记忆?如果是的话,是有办法,我可以在一个流的方式坚持一个RDD磁盘没有试图保持它在内存中?

I find that whenever I kick off the actual processing by calling either count() or toLocalIterator, my operation ultimately ends with out of memory error. Is FPGrowth not partitioning my data? Is my result data so big that getting even a single partition chokes up my memory? If yes, is there a way I can persist an RDD to disk in a "streaming" fashion without trying to hold it in memory?

感谢您的任何见解。

编辑: FPGrowth的一个基本限制是整个FP树必须适应在内存中。所以,关于提高最低支持度阈值的建议是有效的。

A fundamental limitation of FPGrowth is that the entire FP Tree has to fit in memory. So, the suggestions about raising the minimum support threshold are valid.

-Raj

推荐答案

那么,问题很可能是支持度阈值。当您设置一个非常低的值喜欢这里(我不会把一个在一百万频繁),你基本上扔掉封向下财产的所有好处。

Well, the problem is most likely a support threshold. When you set a very low value like here (I wouldn't call one-in-a-million frequent) you basically throw away all the benefits of downward-closure property.

这意味着项目集,这个数字考虑的是呈几何级数增长,在最坏的情况下这将是等于2 N - 100万,其中N是一个数字的项目。除非你有一个玩具数据与极少数项目是根本行不通的。

It means that number of itemsets consider is growing exponentially and in the worst case scenario it will be equal to 2N - 1m where N is a number of items. Unless you have a toy data with a very small number of items it is simply not feasible.

修改

请注意,与200K〜交易(从注释中采取信息),并支持度阈值1E-6每一个套装在你的数据有频繁。所以基本上你想在这里做的是枚举所有观察到的项目集。

Note that with ~200K transactions (information taken from the comments) and support threshold 1e-6 every itemset in your data has to be frequent. So basically what you're trying to do here is to enumerate all observed itemsets.

这篇关于星火MLIB FPGrowth作业失败,内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆