如何处理猪的溢出内存 [英] How to handle spill memory in pig

查看:32
本文介绍了如何处理猪的溢出内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码是这样的:

pymt = LOAD 'pymt' USING PigStorage('|') AS ($pymt_schema);

pymt_grp = GROUP pymt BY key

results = FOREACH pymt_grp {

      /*
       *   some kind of logic, filter, count, distinct, sum, etc.
       */
}

但现在我发现很多这样的日志:

But now I find many logs like that:

org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of 207012796 bytes from 1 objects. init = 5439488(5312K) used = 424200488(414258K) committed = 559284224(546176K) max = 559284224(546176K)

其实我找到了原因,大部分原因是有一个热"键,像key=0这样的IP地址,但我不想过滤这个键.有什么解决办法吗?我已经在我的 UDF 中实现了代数和累加器接口.

Actually I find the cause, the majority reason is that there is a "hot" key, some thing like key=0 as ip address, but I don't want to filter this key. is there any solution? I have implemented algebraic and accumulator interface in my UDF.

推荐答案

我在 FOREACH 中嵌套严重倾斜的数据或 DISTINCT 时遇到了类似的问题(因为 PIG 会在内存中执行不同的操作).解决方案是以 FOREACH 中的 DISTINCT 为例,参见我对 如何在 PIG latin 中优化 group by 语句?

I had similar issues with heavily skewed data or DISTINCT nested in FOREACH (as PIG will do an in memory distinct). The solution was to take the DISTINCT out of the FOREACH as an example see my answer to How to optimize a group by statement in PIG latin?

如果您不想在 SUM 和 COUNT 之前执行 DISTINCT,那么我建议您使用 2 GROUP BY.Key 列上的第一个组加上另一列或随机数 mod 100,它充当 Salt(将单个键的数据传播到多个 Reducers 中).比第二个 GROUP BY 仅在 Key 列上并计算组 1 COUNT 或 Sum 的最终 SUM.

If you do not want to do DISTINCT before your SUM and COUNT than I would suggest to use 2 GROUP BY. The first one groups on Key column plus another column or random number mod 100, it acts as a Salt (to spread the data of a single key into multiple Reducers). Than second GROUP BY just on Key column and calculate the final SUM of the group 1 COUNT or Sum.

例如:

inpt = load '/data.csv' using PigStorage(',') as (Key, Value);
view = foreach inpt generate Key, Value, ((int)(RANDOM() * 100)) as Salt;

group_1 = group view by (Key, Salt);
group_1_count = foreach group_1 generate group_1.Key as Key, COUNT(view) as count;

group_2 = group group_1_count by Key;
final_count = foreach group_2 generate flatten(group) as Key, SUM(group_1_count.count) as count;

这篇关于如何处理猪的溢出内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆