Hadoop Reducer:我如何使用推测性执行输出到多个目录? [英] Hadoop Reducer: How can I output to multiple directories using speculative execution?

查看:125
本文介绍了Hadoop Reducer:我如何使用推测性执行输出到多个目录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个reducer需要将结果输出到不同的目录,以便我们稍后可以将输出用作Hive的输入作为分区表。 (Hive根据文件夹名称创建分区)。为了写出这些位置,我们目前不使用任何Hadoop框架来实现这一点,我们只是写出来分离Hadoop的背后的位置,可以这么说。换句话说,我们没有使用hadoop的API来输出这些文件。



我们遇到了 mapred.reduce.tasks.speculative.execution code>设置为 true 。我明白这是因为同一个任务的多个任务尝试写入同一个位置。

有没有一种方法可以正确使用Hadoop的API从同一个reducer输出到多个不同的文件夹,这样我也可以使用 mapred.reduce。 tasks.speculative.execution = true ? (我知道MultipleOutputs,我不确定它支持推测性执行。)



如果有,有没有办法做到这一点输出到S3?

解决方案

Hadoop通常处理推测性执行的方式是为每个任务尝试创建一个输出文件夹实际HDFS输出目录的临时子文件夹)。

OutputFormat OutputCommitter 当任务成功时,临时任务文件夹的内容到实际输出文件夹,并删除其他任务失败/中止的临时任务文件夹(这是大多数 FileOutputFormat 的默认行为) s)



因此,对于您的情况,如果您正在写入作业输出文件夹之外的文件夹,那么您需要扩展/实现自己的输出提交程序。在创建文件时,我会遵循相同的原则 - 包括完整的任务ID(包括尝试ID),以避免在推测执行时发生名称冲突。如何跟踪作业中创建的文件,并在中止/失败情况下管理删除操作取决于您(可能某些文件对于任务ID是否有效)?

I have a reducer that needs to output results to different directories so that we can later use the output as input to Hive as a partitioned table. (Hive creates partitions based on folder name). In order to write out to these locations, we are currently not using any Hadoop framework to accomplish this, we are just writing out to separate locations "behind Hadoop's back", so to speak. In other words we are not using hadoop's API to output these files.

We had issues with mapred.reduce.tasks.speculative.execution set to true. I understand this to be the case because multiple task attempts for the same task are writing to the same location.

Is there a way to correctly use Hadoop's API to output to several different folders from the same reducer such that I can also use mapred.reduce.tasks.speculative.execution=true ? (I know about MultipleOutputs, which I'm not sure supports speculative execution.)

If so, is there a way to do that and output to S3?

解决方案

The way Hadoop typically deals with speculative execution is to create an output folder for each task attempt (in a _temporary subfolder of the actual HDFS output directory).

The OutputCommitter for the OutputFormat then simply moves the contents of the temp task folder to the actual output folder when a task succeeds, and deletes the other temp task folders for those failed / aborted (this is the default behavior for most FileOutputFormats)

So for your case, if you are writing to a folder outside of the job output folder, then you'll need to extend / implement your own output committer. I'd follow the same principals when creating the files - include the full task id (including the attempt id) to avoid name collisions when speculatively executing. How you track the files created in your job and manage the deletion in the abort / fail scenarios is up to you (maybe some file globing for the task ids?)

这篇关于Hadoop Reducer:我如何使用推测性执行输出到多个目录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆