指定从 Hive 插入生成的文件的最小数量 [英] Specify minimum number of generated files from Hive insert

查看:31
本文介绍了指定从 Hive 插入生成的文件的最小数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 AWS EMR 上使用 Hive 将查询结果插入到按日期分区的 Hive 表中.虽然每天的总输出大小相似,但生成的文件数量各不相同,通常在 6 到 8 个之间,但有时它只会创建一个大文件.我重新运行了几次查询,以防万一文件数量碰巧受到集群中节点可用性的影响,但它似乎是一致的.

I am using Hive on AWS EMR to insert the results of a query into a Hive table partitioned by date. Although the total output size each day is similar, the number of generated files varies, usually between 6 to 8, but some days it creates just a single big file. I reran the query a couple of times, just in case the number of files happens to be influenced by the availability of nodes in the cluster but it seems it's consistent.

所以我的问题是(a) 是什么决定了生成多少文件以及(b) 有没有办法指定最小文件数或(甚至更好)每个文件的最大大小?

So my questions are (a) what determines how many files are generated and (b) is there a way to specify the minimum number of files or (even better) the maximum size of each file?

推荐答案

INSERT ... SELECT 期间生成的文件数量取决于运行在 final reducer 上的进程数(final reducer vertex if您正在 Tez 上运行)加上配置的每个减速器的字节数.

The number of files generated during INSERT ... SELECT depends on the number of processes running on final reducer (final reducer vertex if you are running on Tez) plus bytes per reducer configured.

如果表被分区并且没有指定DISTRIBUTE BY,那么在最坏的情况下,每个reducer都会在每个分区中创建文件.这会给reducer带来很大的压力,可能会导致OOM异常.

If the table is partitioned and there is no DISTRIBUTE BY specified, then in the worst case each reducer creates files in each partition. This creates high pressure on reducers and may cause OOM exception.

为了确保 reducer 只写入一个分区文件,请在查询的末尾添加 DISTRIBUTE BY partition_column.

To make sure reducers are writing only one partition files each, add DISTRIBUTE BY partition_column at the end of your query.

如果数据量太大,并且你想要更多的reducer来增加并行度并为每个分区创建更多的文件,则向分发添加随机数,例如使用:FLOOR(RAND()*100.0)%10 - 它将随机分配 10 个存储桶的数据,因此在每个分区中将有 10 个文件.

If the data volume is too big, and you want more reducers to increase parallelism and to create more files per partition, add random number to the distribute by, for example using this: FLOOR(RAND()*100.0)%10 - it will distribute data additionally by random 10 buckets, so in each partition will be 10 files.

最后你的 INSERT 语句将如下所示:

Finally your INSERT sentence will look like:

INSERT OVERWRITE table PARTITION(part_col)
SELECT * 
  FROM src
DISTRIBUTE BY  part_col, FLOOR(RAND()*100.0)%10; --10 files per partition

此配置设置也会影响生成的文件数:

Also this configuration setting affects the number of files generated:

set hive.exec.reducers.bytes.per.reducer=67108864; 

如果你有太多的数据,Hive 将启动更多的 reducer 来处理每个 reducer 进程指定的不超过 bytes per reducer.减速器越多 - 生成的文件就越多.减少此设置可能会导致运行的减速器数量增加,并且它们将为每个减速器创建至少一个文件.如果分区列不在 distribute by 中,那么每个 reducer 都可以在每个分区中创建文件.

If you have too much data, Hive will start more reducers to process no more than bytes per reducer specified on each reducer process. The more reducers - the more files will be generated. Decreasing this setting may cause increasing the number of reducers running and they will create minimum one file per reducer. If partition column is not in the distribute by then each reducer may create files in each partition.

长话短说,使用

DISTRIBUTE BY  part_col, FLOOR(RAND()*100.0)%10 -- 10 files per partition

如果你想要每个分区 20 个文件,使用 FLOOR(RAND()*100.0)%20;- 如果您有足够的数据,这将保证每个分区至少有 20 个文件,但不能保证每个文件的最大大小.

If you want 20 files per partition, use FLOOR(RAND()*100.0)%20; - this will guarantee minimum 20 files per partition if you have enough data, but will not guarantee the maximum size of each file.

每个减速器设置的字节数不保证它将是固定的最小文件数.文件数量取决于总数据大小/bytes.per.reducer.此设置将保证每个文件的最大大小.

Bytes per reducer setting does not guarantee that it will be the fixed minimum number of files. The number of files will depend of total data size/bytes.per.reducer. This setting will guarantee the maximum size of each file.

但最好使用一些均匀分布的键或低基数的组合而不是随机,因为在容器重启的情况下,rand() 可能会为相同的行产生不同的值,这可能会导致数据重复或丢失(某些减速器输出中已经存在的相同数据将再次分配给另一个减速器).您可以在一些可用的键上计算类似的函数而不是 rand() 以获得或多或少均匀分布的低基数键.

But much better use some evenly distributed key or combination with low cardinality instead of random because in case of containers restart, rand() may produce different values for the same rows and it may cause data duplication or loss(same data which is already present in some reducer output will be distributed one more time to another reducer). You can calculate similar function on some keys available instead of rand() to get more or less evenly distributed key with low cardinality.

您可以结合使用这两种方法:每个减速器限制的字节数 + 分发方式来控制最小文件数和最大文件大小.

You can use both methods combined: bytes per reducer limit + distribute by to control both the minimum number of files and maximum file size.

另请阅读有关使用 distribute by 在减速器之间均匀分配数据的答案:https://stackoverflow.com/a/38475807/2700344

Also read this answer about using distribute by to distribute data evenly between reducers: https://stackoverflow.com/a/38475807/2700344

这篇关于指定从 Hive 插入生成的文件的最小数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆