为什么仅在蜂巢中执行地图操作会在单个输出文件中产生结果 [英] Why does a map only job in hive results in a single output file
问题描述
执行以下查询时,尽管我有8个映射器和0个reducer,但我只得到一个文件作为输出.
When I execute the following query, I get only one file as output although I have 8 mappers and 0 reducers.
create table table_2 as select * from table_1.
8个映射器被调用,并且没有reducer阶段. table_2位置只有一个文件,不应有8个文件,因为我们有8个映射器和0个reducer.
8 mappers are invoked and there is no reducer phase. There is just only one file in the location of table_2, shouldn't there be 8 files as we have 8 mappers and 0 reducers.
推荐答案
From Hive documentation, Configuration Properties...
hive.merge.mapfiles
默认值:true
仅地图作业结束时合并小文件.
hive.merge.mapfiles
Default Value:true
Merge small files at the end of a map-only job.
hive.merge.tezfiles
默认值:false
在Tez DAG末尾合并小文件
hive.merge.tezfiles
Default Value: false
Merge small files at the end of a Tez DAG
hive.merge.smallfiles.avgsize
默认值:
16000000
作业的平均输出文件大小
小于此数字,
Hive将启动额外的map-reduce作业
将输出文件合并成更大的文件...
hive.merge.smallfiles.avgsize
Default Value:
16000000
When the average output file size of a job
is less than this number,
Hive will start an additional map-reduce job
to merge the output files into bigger files...
因此,如果(a)的测试数据集非常小,而(b)则不使用TEZ,而是使用简单的旧版MapReduce,那么Hive将会发布一个帖子-Map步骤默认情况下只是合并(中间)结果.
So, if (a) your test dataset is very small and (b) you don't use TEZ but plain old MapReduce, then Hive will run a post-Map step just to merge the (intermediate) results, by default.
在减少步骤之后它不会发生,除非您将hive.merge.mapredfiles
强制为true
.
Whereas it would not happen after a Reduce step, unless you force hive.merge.mapredfiles
to true
.
这篇关于为什么仅在蜂巢中执行地图操作会在单个输出文件中产生结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!