如何处理猪的泄漏记忆 [英] How to handle spill memory in pig

查看:102
本文介绍了如何处理猪的泄漏记忆的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码如下所示:

  pymt = LOAD'pymt'使用PigStorage('|')AS($ pymt_schema ); 

pymt_grp = GROUP pymt BY键

结果= FOREACH pymt_grp {

/ *
*某种逻辑,过滤器,计数,不同,总和等
* /
}

但现在我找到像这样的许多日志:

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ org.apache.pig.impl.util.SpillableMemoryManager:从1开始估计207012796字节对象。 init = 5439488(5312K)used = 424200488(414258K)committed = 559284224(546176K)max = 559284224(546176K)



其实我找到了原因,其中多数原因是有一个热门键,有些东西像key = 0作为ip地址,但我不想过滤这个键。有没有解决方法?我已经在我的UDF中实现了代数和累加器接口。

解决方案我有类似的问题,严重倾斜数据或DISTINCT嵌套在FOREACH因为PIG将在内存中做不同)。解决的办法是将DISTINCT从FOREACH中取出,例如,参见我对如何通过PIG拉丁语句优化组?



如果您不希望在您之前做DISTINCT SUM和COUNT比我建议使用2 GROUP BY。第一组在Key列上加上另一列或随机数字100,它充当Salt(将单个密钥的数据分散到多个Reducer中)。比第二个GROUP BY仅仅在Key列上计算组1 COUNT或总和的最后总和。



例如:

  inpt =使用PigStorage(',')作为(Key,Value)加载'/data.csv'; 
view = foreach inpt生成Key,Value,((int)(RANDOM()* 100));

group_1 =组合视图(Key,Salt);
group_1_count = foreach group_1生成group_1.Key作为Key,COUNT(view)作为count;

group_2 = group group_1_count by Key;
final_count = foreach group_2生成flatten(group)为Key,SUM(group_1_count.count)为count;


My code like like this:

pymt = LOAD 'pymt' USING PigStorage('|') AS ($pymt_schema);

pymt_grp = GROUP pymt BY key

results = FOREACH pymt_grp {

      /*
       *   some kind of logic, filter, count, distinct, sum, etc.
       */
}

But now I find many logs like that:

org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of 207012796 bytes from 1 objects. init = 5439488(5312K) used = 424200488(414258K) committed = 559284224(546176K) max = 559284224(546176K)

Actually I find the cause, the majority reason is that there is a "hot" key, some thing like key=0 as ip address, but I don't want to filter this key. is there any solution? I have implemented algebraic and accumulator interface in my UDF.

解决方案

I had similar issues with heavily skewed data or DISTINCT nested in FOREACH (as PIG will do an in memory distinct). The solution was to take the DISTINCT out of the FOREACH as an example see my answer to How to optimize a group by statement in PIG latin?

If you do not want to do DISTINCT before your SUM and COUNT than I would suggest to use 2 GROUP BY. The first one groups on Key column plus another column or random number mod 100, it acts as a Salt (to spread the data of a single key into multiple Reducers). Than second GROUP BY just on Key column and calculate the final SUM of the group 1 COUNT or Sum.

Ex:

inpt = load '/data.csv' using PigStorage(',') as (Key, Value);
view = foreach inpt generate Key, Value, ((int)(RANDOM() * 100)) as Salt;

group_1 = group view by (Key, Salt);
group_1_count = foreach group_1 generate group_1.Key as Key, COUNT(view) as count;

group_2 = group group_1_count by Key;
final_count = foreach group_2 generate flatten(group) as Key, SUM(group_1_count.count) as count;

这篇关于如何处理猪的泄漏记忆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆