部分聚合与组合器哪个更快? [英] Partial aggregation vs Combiners which one faster?

查看:323
本文介绍了部分聚合与组合器哪个更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有关级联/缩放的优化方式的通知地图端评估 他们使用所谓的部分聚合. 实际上是比合并器更好的方法吗?在某些常见的Hadoop任务(例如字数统计)上是否有性能比较? 如果是这样,那么hadoop将来会支持吗?

There are notice about what how cascading/scalding optimized map-side evaluation They use so called Partial Aggregation. Is it actually better approach then Combiners? Are there any performance comparison on some common hadoop tasks(word count for example)? If so wether hadoop will support this in future?

推荐答案

在实践中,部分聚合比使用组合器具有更多的好处.

In practice, there are more benefits from partial aggregation than from use of combiners.

组合器有用的情况是有限的.此外,组合器还优化了任务所需的吞吐量,而不是减少的数量-这是一个微妙的区别,它会导致明显的性能差异.

The cases where combiners are useful are limited. Also, combiners optimize the amount of throughput required by the tasks, not the number of reduces -- that's a subtle distinction which adds up to significant performance deltas.

在大型分布式工作流中,部分聚合的用例范围更加广泛.同样,可以使用部分聚合来优化工作流程所需的作业步骤数.

There is a much broader range of use cases for partial aggregation in large distributed workflows. Also, partial aggregation can be used to optimize the number of job steps required for a workflow.

示例显示在 https://github.com/Cascading/Im Patient/wiki/第5部分,其中使用CountBySumBy部分聚合.如果您回顾该项目在GitHub上的代码提交历史记录,以前曾经使用过GroupByCount,这导致了更多的减少.

Examples are shown in https://github.com/Cascading/Impatient/wiki/Part-5 which uses CountBy and SumBy partial aggregates. If you look back in the code commit history on GitHub for that project, there was previously use of GroupBy and Count, which resulted in more reduces.

这篇关于部分聚合与组合器哪个更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆