在 PIG 中加入 vs COGROUP [英] Join vs COGROUP in PIG

查看:30
本文介绍了在 PIG 中加入 vs COGROUP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我在猪中使用 COGROUP 而不是 JOIN 时,是否有任何优势(wrt 性能/地图减少)?

Are there any advantages (wrt performance / no of map reduces ) when i use COGROUP instead of JOIN in pig ?

http://developer.yahoo.com/hadoop/tutorial/module6.html 谈论他们产生的输出类型的差异.但是,忽略输出模式",性能有什么显着差异吗?

http://developer.yahoo.com/hadoop/tutorial/module6.html talks about the difference in the type of output they produce. But, ignoring the "output schema", are there any significant difference in performance ?

推荐答案

没有重大的性能差异.我这么说的原因是它们最终都是一个 MapReduce 作业,将相同的数据转发给 reducer.两者都需要将所有记录转发给外键.如果有的话,COGROUP 可能会快一点,因为它不会对命中进行笛卡尔积,而是将它们保存在单独的包中.

There are no major performance differences. The reason I say this is they both end up being a single MapReduce job that send the same data forward to the reducers. Both need to send all of the records forward with the key being the foreign key. If at all, the COGROUP might be a bit faster because it does not do the cartesian product across the hits and keeps them in separate bags.

如果您的一个数据集很小,您可以使用名为 "复制连接".这将在所有地图任务中分配第二个数据集并将其加载到主内存中.这样,它可以在映射器中完成整个连接,而不需要减速器.根据我的经验,这是非常值得的,因为 joins 和 cogroups 的瓶颈是整个数据集到 reducer 的 shuffle.据我所知,你不能用 COGROUP 做到这一点.

If one of your data sets is small, you can use a join option called "replicated join". This will distribute the second data set across all map tasks and load it into main memory. This way, it can do the entire join in the mapper and not need a reducer. In my experience, this is very worth it because the bottleneck in joins and cogroups is the shuffling of the entire data set to the reducer. You can't do this with COGROUP, to my knowledge.

这篇关于在 PIG 中加入 vs COGROUP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆