Hadoop处理Reducer中的数据倾斜 [英] Hadoop handling data skew in reducer

查看:215
本文介绍了Hadoop处理Reducer中的数据倾斜的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图确定在hadoop api(hadoop 2.0.0 mrv1)中是否有某些可用的钩子来处理减速器的数据倾斜。
场景:有一个自定义的组合键和分区器来将数据路由到reducer。为了处理这种奇怪的情况,很可能出现一百万个密钥的情况,以及在同一个reducer上结束的较大值需要某种启发式方法,以便可以对这些数据进行进一步分割以产生新的reducer。
正在考虑两步过程


  1. 设置mapred.max.reduce.failures.percent为10%,让
    完成

  2. 通过传递
    配置通过驱动程序重新运行失败的数据集上的作业,这会导致我的分区程序到
    ,然后随机分区偏斜的数据。分区程序将
    实现Configurable接口。

有没有更好的方法/另一种方式?



可能的解决方案可能是写映射器的输出,然后关掉另一个执行reducer工作的map作业,但不想压迫namenode。

解决方案

这个想法出现在我的脑海里,我不确定它有多好。

您目前正在使用10个映射器运行Job,由于数据偏斜而失败。这个想法是,你将reducer的数量设置为15,并且定义每个mapper的最大数量(key,value)应该到哪一个reducer。您将该信息保存在您的定制分区程序类中的哈希映射中。一旦某个特定的减速器达到了极限,您就开始从额外的5个减速器向另一个减速器发送下一组(密钥,数值)对,我们一直处理偏差。


Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer. Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers. Am thinking of a two step process

  1. set mapred.max.reduce.failures.percent to say 10% and let the job complete
  2. rerun the job on the failed data set by passing a configuration thru the driver which will cause my partitioner to then randomly partition the skewed data. The partitioner will implement the Configurable interface.

Is there a better way/another way ?

Possible counter-solution may be to write output of mappers and spin off another map job doing the work of the reducer, but do not want to pressurize the namenode.

解决方案

This idea comes to my mind, I am not sure how good it is.

Lets say you are running the Job with 10 mappers currently, which is failing because of the data skewness. The idea is, you set the number of reducer to 15 and also define what the max number of (key,value) should go to one reducer from each mapper. You keep that information in a hash map in your custom partitioner class. Once a particular reducer reaches the limit, you start sending the next set of (key,value) pairs to another reducer from the extra 5 reducer which we have kept for handling the skewness.

这篇关于Hadoop处理Reducer中的数据倾斜的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆