谁有机会先执行合并器或分区器? [英] Who will get a chance to execute first , Combiner or Partitioner?

查看:115
本文介绍了谁有机会先执行合并器或分区器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

阅读以下有关Hadoop的文章(权威指南第4版(第204页))后,我感到困惑

I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204)

  • 在写入磁盘之前,线程首先将数据划分为 最终将与减速器相对应的分区 发给.

  • Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.

在每个分区中,后台线程执行 按键在内存中排序,如果有组合器功能,则运行该功能 在排序的输出上.

Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.

运行组合器功能可以使 映射输出更加紧凑,因此要写入本地磁盘的数据更少 并转移到减速器上.

Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

这是我的疑问:

1)谁将执行第一个组合器或分区!

1) Who will execute first combiner or partitions !!

2)当将有自定义组合器和自定义分区时,执行步骤的层次结构又将如何?

2) When custom combiner and custom partitions will be there so how and what will be the execution steps hierarchy ?

3)我们可以将压缩数据(avro,sequence..etc等)馈送到自定义组合器吗?如果可以,那么如何!

3) Can we feed compress data (avro ,sequence ..etc) to Custom combiner ,if yes then how!!

需要简短而深入的解释!!

Looking for a brief and in-depth explanation!!

先谢谢了.

推荐答案

1/在此部分中已经指定了响应:在每个分区中,后台线程按键在内存中进行排序,如果有组合器函数,它在排序的输出上运行."

1/ The response is already specified in this part: "Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort."

因此,首先在内存中创建分区,如果有一个自定义组合器,它将在内存中执行,结果最终将溢出到磁盘上.

So firstly the partitions are created in-memory, if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end.

2/自定义组合器和自定义分区将在其中.

2/ custom combiner and custom partition will be there when they are specified on the driver class.

job.setCombinerClass(MyCombiner.class);
job.setPartitionerClass(MyPartitioner.class);

如果未指定自定义组合器,则不执行任何组合器. 如果未指定自定义分区程序,则默认执行的分区程序为"HashPartitioner"(请参阅​​第221页).

If there is no custom combiner specified, so there is no combiner executed. If there is no custom partitioner specified, so the default executed partitioner is "HashPartitioner" (please see the page 221 for that).

3/是的,有可能.不要忘记,组合器的机制与减速器相同.减速器可以消耗压缩数据. 如果使用者使用压缩数据,则表示输入文件格式已压缩. 为此,您可以在驱动程序类上指定指令:

3/ Yes, it is possible. Don't forget that the mechanism of the combiner is the same than the reducer. The reducer can consume compressed data. If the consumer consumes the compressed data, that means that the input files format is compressed. for that, you can specify on the driver class the instruction:

Sequence File case: job.setInputFormatClass(SequenceFileInputFormat.class);
Avro File case: job.setInputFormatClass(AvroKeyInputFormat.class); 

这篇关于谁有机会先执行合并器或分区器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆