如何使用Hive确定HDFS中的文件大小 [英] How to determine file size in HDFS using Hive

查看:2203
本文介绍了如何使用Hive确定HDFS中的文件大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的工作区是使用Hive 1.1.0和CDH 5.5.4设置的。我做了一个带有22个分区结果的查询。保存在此分区目录中的文件始终是唯一的,可以从20MB变为700MB。



据我所知,这与在查询过程。让我们假设我想为每个分区有5个文件,而不是1个,我使用这个命令:

  set mapreduce.job .reduces = 5; 

这将使系统在阶段1中使用5个减少任务,但会自动切换到1个减速器阶段2(在编译时自动确定)。从我读到的情况来看,这是由于编译器在选择减速器数量时比配置更重要。看起来有些任务不能'平行',只能由一个进程或减速任务完成,所以系统会自动确定它。



代码:

 插入表core.pae_ind1分区(项目,ut,年,月)
选择ts,date_time,if(
- m1
code_ac_dcu_m1_d1 = 0
和(min(当code_ac_dcu_m1_d1 = 1,然后ts结束时)(分区由ut
按ts行在1之间跟1000之后) - ts)<= 15,
min(code_ac_dcu_m1_d1 = 1然后ts结束时的情况)over(由ut命令
分区,ts之后,1之间和1000之后)-ts,NULL)为
t_open_dcu_m1_d1,
$ b $ if(code_ac_dcu_m1_d1 = 2
和(min(case code_ac_dcu_m1_d1 = 3 then ts end))(ut分隔
order by ts rows between 1 following和1000以下)-ts)≤15,
min(code_ac_dcu_m1_d1 = 3,然后ts结束的情况)over(由ut命令
划分ts行,在1之后和1000之后)-ts, ñ ULL)作为
t_close_dcu_m1_d1,
项目,ut,年,月

from core.pae_open_close
其中ut ='902'
order by ut, ts

这导致最后有巨大的文件。我想知道是否有办法将这些结果文件拆分为更小的文件(最好是通过大小来限制它们)。

解决方案

@DuduMarkovitz指出,您的代码包含指令来全局排序数据集。这将在单个减速器上运行。您从您的表中选择时更好地订购。即使您的文件在插入后依然存在并且它们是可拆分的 - 它们将在许多映射器上被读取,然后由于并行性而导致结果不顺序,您将需要订购。只需在插入中通过ut,ts 删除顺序,并使用这些配置设置来控制缩减器的数量:

  set hive.exec.reducers.bytes.per.reducer = 67108864; 
set hive.exec.reducers.max = 2000; - 默认1009

根据

mapred.reduce.tasks - 每个作业的减少任务的默认数量。通常设置为接近可用主机的数量。 mapred.job.tracker 本地时忽略。 Hadoop默认设置为1,而Hive使用-1作为默认值。通过将此属性设置为-1,Hive会自动找出应该是减数的数量。



hive.exec.reducers.bytes。 per.reducer - Hive 0.14.0及更早版本的默认值为1 GB。



hive.exec。 reducer.max - 将使用的最大减速器数量。如果 mapred.reduce.tasks 为负数,Hive将在自动确定减速器数量时将此作为减速器的最大数量。



因此,如果您想增加reducer并行性,请增加 hive.exec.reducers.max 并减少 hive.exec.reducers。 bytes.per.reducer
每个reducer将为每个分区创建一个文件(不大于hive.exec.reducers.bytes.per.reducer)。一个reducer可能会收到很多分区数据,因此会在每个分区中创建许多小文件。这是因为在洗牌阶段分区数据将分布在许多减速器之间。



如果您不想让每个reducer创建每个(或太多)分区,那么通过分区键分配而不是命令)。在这种情况下,分区中的文件数量将更像 partition_size / hive.exec.reducers.bytes.per.reducer


The workspace i am using is set with Hive 1.1.0 and CDH 5.5.4. I make a query which brings a 22 partitions result. The files saved in this partitions directories are always unique, and can variate from 20MB to 700MB.

From what i understood, this is related to the number of reducers used in the process of the query. Let´s assume i want to have 5 files for each partition instead of 1, i use this command:

set mapreduce.job.reduces=5;

This will make the system use 5 reduce tasks in stage 1, but will automatically switch to 1 reducer at stage 2 (determined automatically at compile time). From what i read, this is due to compiler having more importance than configuration at the time of choosing the number of reducers. It seems that some tasks can not be 'paralelized' and can only be done by one process or reducer task, so system will automatically determine it.

Code :

insert into table core.pae_ind1 partition (project,ut,year,month)
select ts,date_time, if(
-- m1
code_ac_dcu_m1_d1=0
and (min(case when code_ac_dcu_m1_d1=1 then ts end ) over (partition by ut 
order by ts rows between 1 following and 1000 following)-ts) <= 15,
min(case when code_ac_dcu_m1_d1=1 then ts end ) over (partition by ut order 
by ts rows between 1 following and 1000 following)-ts,NULL) as 
t_open_dcu_m1_d1,

if( code_ac_dcu_m1_d1=2
and (min(case when code_ac_dcu_m1_d1=3 then ts end ) over (partition by ut 
order by ts rows between 1 following and 1000 following)-ts) <= 15,
min(case when code_ac_dcu_m1_d1=3 then ts end ) over (partition by ut order 
by ts rows between 1 following and 1000 following)-ts, NULL) as 
t_close_dcu_m1_d1,
project,ut,year,month

from core.pae_open_close
where ut='902'
order by ut,ts

This leads to having huge files at the end. I would like to know if there is a way of splitting this result files into smaller ones (preferably limiting them by size).

解决方案

As @DuduMarkovitz pointed, your code contains instruction to order globally the dataset. This will run on single reducer. You better order during select from your table. Even if your files are in order after such insert and they are splittable - they will be read on many mappers then the result will be not in order due to parallelism and you will need to order. Just get rid of this order by ut,ts in the insert and use these configuration settings for controlling the number of reducers:

set hive.exec.reducers.bytes.per.reducer=67108864;  
set hive.exec.reducers.max = 2000; --default 1009 

The number of reducers determined according to

mapred.reduce.tasks - The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.

hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.

Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapred.reduce.tasks is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.

So, if you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer Each reducer will create one file for each partition (not bigger than hive.exec.reducers.bytes.per.reducer ). It's possible that one reducer will receive many partitions data and as a result will create many small files in each partition. It's because on shuffle phase partitions data will be distributed between many reducers.

If you do not want each reducer to create every (or too many) partitions then distribute by partition key (instead of order). In this case the number of files in the partition will be more like partition_size/hive.exec.reducers.bytes.per.reducer

这篇关于如何使用Hive确定HDFS中的文件大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆