Hadoop 处理减速器中的数据倾斜 [英] Hadoop handling data skew in reducer

查看:29
本文介绍了Hadoop 处理减速器中的数据倾斜的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试确定 hadoop api(hadoop 2.0.0 mrv1)中是否有某些可用的钩子来处理减速器的数据倾斜.场景:有一个自定义的复合键和分区器来将数据路由到减速器.为了处理奇怪的情况,但很可能有一百万个键和大值出现在同一个减速器上,需要某种启发式方法,以便可以进一步分区这些数据以产生新的减速器.我正在考虑一个两步过程

Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer. Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers. Am thinking of a two step process

  1. 将 mapred.max.reduce.failures.percent 设置为 10% 并让工作完成
  2. 通过传递一个在失败的数据集上重新运行作业通过驱动程序进行配置,这将导致我的分区程序然后对倾斜的数据进行随机分区.分区器将实现可配置接口.

有更好的方法/另一种方法吗?

Is there a better way/another way ?

可能的反解决方案可能是写入映射器的输出并分拆另一个执行减速器工作的映射作业,但不想给名称节点加压.

Possible counter-solution may be to write output of mappers and spin off another map job doing the work of the reducer, but do not want to pressurize the namenode.

推荐答案

这个想法出现在我的脑海中,我不确定它有多好.

This idea comes to my mind, I am not sure how good it is.

假设您当前使用 10 个映射器运行作业,由于数据偏斜而失败.这个想法是,您将 reducer 的数量设置为 15,并定义每个映射器的 (key,value) 的最大数量应该分配给一个 reducer.您将该信息保存在自定义分区程序类的哈希映射中.一旦某个特定的 reducer 达到限制,您就开始将下一组 (key, value) 对从我们保留用于处理偏斜的额外 5 个 reducer 发送到另一个 reducer.

Lets say you are running the Job with 10 mappers currently, which is failing because of the data skewness. The idea is, you set the number of reducer to 15 and also define what the max number of (key,value) should go to one reducer from each mapper. You keep that information in a hash map in your custom partitioner class. Once a particular reducer reaches the limit, you start sending the next set of (key,value) pairs to another reducer from the extra 5 reducer which we have kept for handling the skewness.

这篇关于Hadoop 处理减速器中的数据倾斜的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆