您如何在数据非常倾斜的养猪工作中提高性能? [英] How do you improve performance on a pig job that has very skewed data?

查看:68
本文介绍了您如何在数据非常倾斜的养猪工作中提高性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个执行GROUP BY的Pig脚本和一个或两个归约任务需要花费数小时才能运行的嵌套FOREACH.例如:

I am running a pig script that performs a GROUP BY and a nested FOREACH that takes hours to run due to one or two reduce tasks. For example:

B = GROUP A BY (fld1, fld2) parallel 50;

C = FOREACH B {
   U = A.fld1;
   DIST = DISTINCT U;
   GENERATE FLATTEN(group), COUNT_STAR(DIST);
}

在检查慢速任务的计数器时,我意识到,看起来两个减速器正在处理比其他任务更多的数据.基本上,我的理解是数据非常不对称,因此慢速"任务实际上比快速任务要完成更多的工作.我只是想知道如何提高性能?我讨厌增加并行度以尝试拆分工作,但这是唯一的方法吗?

Upon examining the counters for the slow tasks, I realized that it looks like the two reducers are processing through a lot more data than the other tasks. Basically, my understanding is that the data is very skewed and so the tasks that are "slow" are in fact doing more work than the fast tasks. I'm just wondering how to improve performance? I hate increasing the parallelism to try to split up the work but is that the only way?

推荐答案

第一个选择是使用自定义分区程序. 请参阅GROUP上的文档以获取更多信息(请参见PARTITION BY ,特别是).不幸的是,您可能必须在这里编写自己的自定义分区程序.在您的自定义分区程序中,将第一组大型键发送给reducer 0,将下一组键发送给reducer 1,然后对剩下的所有键进行标准哈希分区.这样做是让一个reducer专门处理大型的,而其他的则获得多组密钥.但是,这并不总是可以解决歪斜严重的问题.

The first option is to use a custom partitioner. Check out the documentation on GROUP for more info (check out PARTITION BY, specifically). Unfortunately, you probably have to write your own custom partitioner here. In your custom partitioner, send the first huge set of keys to reducer 0, send the next set to reducer 1, then do the standard hash partitioning across what's left. What this does is lets one reducer handle the big ones exclusively, while the others get multiple sets of keys. This doesn't always solve the problem with bad skew, though.

这两个庞大的数据集的价值有多大?当NULL或空字符串之类的东西出现时,我会看到很多偏差.如果它们不那么有价值,请在GROUP BY之前将其过滤掉.

How valuable is the count for those two huge sets of data? I see huge skew a lot when things like NULL or empty string. If they aren't that valuable, filter them out before the GROUP BY.

这篇关于您如何在数据非常倾斜的养猪工作中提高性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆