您如何提高数据非常倾斜的养猪作业的性能? [英] How do you improve performance on a pig job that has very skewed data?

查看:20
本文介绍了您如何提高数据非常倾斜的养猪作业的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个执行 GROUP BY 和嵌套 FOREACH 的 Pig 脚本,由于一两个 reduce 任务,该脚本需要数小时才能运行.例如:

I am running a pig script that performs a GROUP BY and a nested FOREACH that takes hours to run due to one or two reduce tasks. For example:

B = GROUP A BY (fld1, fld2) parallel 50;

C = FOREACH B {
   U = A.fld1;
   DIST = DISTINCT U;
   GENERATE FLATTEN(group), COUNT_STAR(DIST);
}

在检查慢速任务的计数器后,我意识到两个 reducer 处理的数据似乎比其他任务多得多.基本上,我的理解是数据非常倾斜,因此慢"的任务实际上比快速任务做的工作更多.我只是想知道如何提高性能?我讨厌增加并行度以尝试拆分工作,但这是唯一的方法吗?

Upon examining the counters for the slow tasks, I realized that it looks like the two reducers are processing through a lot more data than the other tasks. Basically, my understanding is that the data is very skewed and so the tasks that are "slow" are in fact doing more work than the fast tasks. I'm just wondering how to improve performance? I hate increasing the parallelism to try to split up the work but is that the only way?

推荐答案

第一个选项是使用自定义分区程序.查看关于 GROUP 的文档了解更多信息(查看PARTITION BY,特别是).不幸的是,您可能必须在这里编写自己的自定义分区器.在您的自定义分区器中,将第一组巨大的键发送到 reducer 0,将下一组发送到 reducer 1,然后对剩下的进行标准哈希分区.这样做是让一个 reducer 专门处理大的,而其他人得到多组键.但是,这并不总能解决严重偏斜的问题.

The first option is to use a custom partitioner. Check out the documentation on GROUP for more info (check out PARTITION BY, specifically). Unfortunately, you probably have to write your own custom partitioner here. In your custom partitioner, send the first huge set of keys to reducer 0, send the next set to reducer 1, then do the standard hash partitioning across what's left. What this does is lets one reducer handle the big ones exclusively, while the others get multiple sets of keys. This doesn't always solve the problem with bad skew, though.

这两个庞大的数据集的计数有多大价值?当诸如 NULL 或空字符串之类的东西时,我看到了很大的偏差.如果它们不是那么有价值,请在 GROUP BY 之前过滤掉它们.

How valuable is the count for those two huge sets of data? I see huge skew a lot when things like NULL or empty string. If they aren't that valuable, filter them out before the GROUP BY.

这篇关于您如何提高数据非常倾斜的养猪作业的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆