指定从Hive插入生成的文件的最小数量 [英] Specify minimum number of generated files from Hive insert

查看:625
本文介绍了指定从Hive插入生成的文件的最小数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在AWS EMR上使用Hive将查询结果插入按日期划分的Hive表中.尽管每天的总输出大小相似,但是生成的文件数通常在6到8之间变化,但是有时它仅创建一个大文件.我重新运行了几次查询,以防万一文件的数量恰好受到集群中节点可用性的影响,但这似乎是一致的.

I am using Hive on AWS EMR to insert the results of a query into a Hive table partitioned by date. Although the total output size each day is similar, the number of generated files varies, usually between 6 to 8, but some days it creates just a single big file. I reran the query a couple of times, just in case the number of files happens to be influenced by the availability of nodes in the cluster but it seems it's consistent.

所以我的问题是 (a)确定生成多少文件的因素以及 (b)有没有办法指定每个文件的最小数量或(甚至更好)最大文件大小?

So my questions are (a) what determines how many files are generated and (b) is there a way to specify the minimum number of files or (even better) the maximum size of each file?

推荐答案

INSERT ... SELECT期间生成的文件数取决于在最终reducer上运行的进程数(如果您在Tez上运行,则为final reducer顶点)加上每个字节数减速器已配置.

The number of files generated during INSERT ... SELECT depends on the number of processes running on final reducer (final reducer vertex if you are running on Tez) plus bytes per reducer configured.

如果表已分区并且未指定DISTRIBUTE BY,则在最坏的情况下,每个reducer都会在每个分区中创建文件.这会给减速器带来很大压力,并可能导致OOM异常.

If the table is partitioned and there is no DISTRIBUTE BY specified, then in the worst case each reducer creates files in each partition. This creates high pressure on reducers and may cause OOM exception.

要确保减速器每个只写一个分区文件,请在查询末尾添加DISTRIBUTE BY partition_column.

To make sure reducers are writing only one partition files each, add DISTRIBUTE BY partition_column at the end of your query.

如果数据量太大,并且您想要更多的reducer来提高并行度并为每个分区创建更多文件,请向分发中添加随机数,例如,使用以下方法:FLOOR(RAND()*100.0)%10-它将通过以下方式另外分发数据随机有10个存储桶,因此在每个分区中将有10个文件.

If the data volume is too big, and you want more reducers to increase parallelism and to create more files per partition, add random number to the distribute by, for example using this: FLOOR(RAND()*100.0)%10 - it will distribute data additionally by random 10 buckets, so in each partition will be 10 files.

最后,您的INSERT句子将如下所示:

Finally your INSERT sentence will look like:

INSERT OVERWRITE table PARTITION(part_col)
SELECT * 
  FROM src
DISTRIBUTE BY  part_col, FLOOR(RAND()*100.0)%10; --10 files per partition

此配置设置还会影响生成的文件数:

Also this configuration setting affects the number of files generated:

set hive.exec.reducers.bytes.per.reducer=67108864; 

如果数据太多,Hive将启动更多的化简机以处理不超过每个化简机进程中指定的bytes per reducer的数量. Reducer越多-将生成更多文件.减小此设置可能会导致运行的减速器数量增加,并且每个减速器将创建最少一个文件.如果distribute by中没有分区列,则每个reducer都可以在每个分区中创建文件.

If you have too much data, Hive will start more reducers to process no more than bytes per reducer specified on each reducer process. The more reducers - the more files will be generated. Decreasing this setting may cause increasing the number of reducers running and they will create minimum one file per reducer. If partition column is not in the distribute by then each reducer may create files in each partition.

要简短地说,请使用

DISTRIBUTE BY  part_col, FLOOR(RAND()*100.0)%10 -- 10 files per partition

如果每个分区要20个文件,请使用FLOOR(RAND()* 100.0)%20; -如果您有足够的数据,这将保证每个分区最少20个文件,但不能保证每个文件的最大大小.

If you want 20 files per partition, use FLOOR(RAND()*100.0)%20; - this will guarantee minimum 20 files per partition if you have enough data, but will not guarantee the maximum size of each file.

每个reducer设置的字节数不能保证它将是固定的最小文件数.文件数取决于总数据大小/bytes.per.reducer. 此设置将保证每个文件的最大大小.

Bytes per reducer setting does not guarantee that it will be the fixed minimum number of files. The number of files will depend of total data size/bytes.per.reducer. This setting will guarantee the maximum size of each file.

但是最好使用一些均匀分布的键或低基数的组合,而不是随机的,因为在容器重新启动的情况下,rand()可能为同一行产生不同的值,并且可能导致数据重复或损失(某些减速器输出中已经存在的相同数据将再分配一次到另一个减速器中).您可以对一些可用的键(而不是rand())计算相似的功能,以获取具有低基数的或多或少均匀分布的键.

But much better use some evenly distributed key or combination with low cardinality instead of random because in case of containers restart, rand() may produce different values for the same rows and it may cause data duplication or loss(same data which is already present in some reducer output will be distributed one more time to another reducer). You can calculate similar function on some keys available instead of rand() to get more or less evenly distributed key with low cardinality.

您可以将两种方法结合使用:每个reducer限制的字节+通过分配以控制最小文件数和最大文件大小.

You can use both methods combined: bytes per reducer limit + distribute by to control both the minimum number of files and maximum file size.

也请阅读有关使用distribute by在减速器之间平均分配数据的答案: https://stackoverflow.com/a/38475807/2700344

Also read this answer about using distribute by to distribute data evenly between reducers: https://stackoverflow.com/a/38475807/2700344

这篇关于指定从Hive插入生成的文件的最小数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆