如何使用 Hive 确定 HDFS 中的文件大小 [英] How to determine file size in HDFS using Hive

查看:173
本文介绍了如何使用 Hive 确定 HDFS 中的文件大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的工作区设置为 Hive 1.1.0 和 CDH 5.5.4.我进行了一个查询,它带来了 22 个分区的结果.此分区目录中保存的文件始终是唯一的,大小从 20MB 到 700MB 不等.

The workspace i am using is set with Hive 1.1.0 and CDH 5.5.4. I make a query which brings a 22 partitions result. The files saved in this partitions directories are always unique, and can variate from 20MB to 700MB.

据我了解,这与查询过程中使用的reducer数量有关.假设我希望每个分区有 5 个文件而不是 1 个,我使用以下命令:

From what i understood, this is related to the number of reducers used in the process of the query. Let´s assume i want to have 5 files for each partition instead of 1, i use this command:

set mapreduce.job.reduces=5;

这会让系统在第一阶段使用5个reduce任务,但会在第二阶段自动切换到1个reducer(编译时自动确定).从我读到的内容来看,这是因为在选择减速器数量时,编译器比配置更重要.好像有些任务不能并行化",只能由一个进程或reducer任务完成,所以系统会自动判断.

This will make the system use 5 reduce tasks in stage 1, but will automatically switch to 1 reducer at stage 2 (determined automatically at compile time). From what i read, this is due to compiler having more importance than configuration at the time of choosing the number of reducers. It seems that some tasks can not be 'paralelized' and can only be done by one process or reducer task, so system will automatically determine it.

代码:

insert into table core.pae_ind1 partition (project,ut,year,month)
select ts,date_time, if(
-- m1
code_ac_dcu_m1_d1=0
and (min(case when code_ac_dcu_m1_d1=1 then ts end ) over (partition by ut 
order by ts rows between 1 following and 1000 following)-ts) <= 15,
min(case when code_ac_dcu_m1_d1=1 then ts end ) over (partition by ut order 
by ts rows between 1 following and 1000 following)-ts,NULL) as 
t_open_dcu_m1_d1,

if( code_ac_dcu_m1_d1=2
and (min(case when code_ac_dcu_m1_d1=3 then ts end ) over (partition by ut 
order by ts rows between 1 following and 1000 following)-ts) <= 15,
min(case when code_ac_dcu_m1_d1=3 then ts end ) over (partition by ut order 
by ts rows between 1 following and 1000 following)-ts, NULL) as 
t_close_dcu_m1_d1,
project,ut,year,month

from core.pae_open_close
where ut='902'
order by ut,ts

这会导致最后出现大文件.我想知道是否有办法将此结果文件拆分为较小的文件(最好按大小限制).

This leads to having huge files at the end. I would like to know if there is a way of splitting this result files into smaller ones (preferably limiting them by size).

推荐答案

正如@DuduMarkovitz 所指出的,您的代码包含对数据集进行全局排序的指令.这将在单个减速器上运行.您最好在从表中选择时订购.即使您的文件在这样的插入之后是有序的并且它们是可拆分的 - 它们将在许多映射器上被读取,那么由于并行性,结果将不会有序,您将需要订购.只需在插入中去掉这个 order by ut,ts 并使用这些配置设置来控制减速器的数量:

As @DuduMarkovitz pointed, your code contains instruction to order globally the dataset. This will run on single reducer. You better order during select from your table. Even if your files are in order after such insert and they are splittable - they will be read on many mappers then the result will be not in order due to parallelism and you will need to order. Just get rid of this order by ut,ts in the insert and use these configuration settings for controlling the number of reducers:

set hive.exec.reducers.bytes.per.reducer=67108864;  
set hive.exec.reducers.max = 2000; --default 1009 

reducer数量根据

The number of reducers determined according to

mapred.reduce.tasks - 每个作业的默认缩减任务数.通常设置为接近可用主机数量的素数.当 mapred.job.trackerlocal"时忽略.默认情况下,Hadoop 将此设置为 1,而 Hive 使用 -1 作为其默认值.通过将此属性设置为 -1,Hive 将自动计算出减速器的数量应该是多少.

mapred.reduce.tasks - The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.

hive.exec.reducers.bytes.per.reducer - Hive 0.14.0 及更早版本中的默认值为 1 GB.

hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.

Also hive.exec.reducers.max - 将使用的最大减速器数量.如果mapred.reduce.tasks 为负数,Hive 会在自动确定reducer 数量时以此作为最大reducer 数量.

Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapred.reduce.tasks is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.

所以,如果你想增加reducers的并行度,增加hive.exec.reducers.max并减少hive.exec.reducers.bytes.per.reducer每个 reducer 将为每个分区创建一个文件(不大于 hive.exec.reducers.bytes.per.reducer ).一个reducer可能会收到很多分区的数据,结果会在每个分区中创建很多小文件.这是因为在 shuffle 阶段分区数据将分布在许多减速器之间.

So, if you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer Each reducer will create one file for each partition (not bigger than hive.exec.reducers.bytes.per.reducer ). It's possible that one reducer will receive many partitions data and as a result will create many small files in each partition. It's because on shuffle phase partitions data will be distributed between many reducers.

如果您不希望每个 reducer 创建每个(或太多)分区,则 按分区键分发(而不是顺序).在这种情况下,分区中的文件数量将更像 partition_size/hive.exec.reducers.bytes.per.reducer

If you do not want each reducer to create every (or too many) partitions then distribute by partition key (instead of order). In this case the number of files in the partition will be more like partition_size/hive.exec.reducers.bytes.per.reducer

这篇关于如何使用 Hive 确定 HDFS 中的文件大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆