使用一个文件在 Hive 中创建表 [英] Create Table in Hive with one file

查看:33
本文介绍了使用一个文件在 Hive 中创建表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下命令在 Hive 中创建一个新表:

I'm creating a new table in Hive using:

CREATE TABLE new_table AS select * from old_table;

我的问题是在创建表后,它为每个分区生成多个文件 - 而我只需要每个分区一个文件.

My problem is that after the table is created, It generates multiple files for each partition - while I want only one file for each partition.

如何在表格中定义它?谢谢!

How can I define it in the table? Thank you!

推荐答案

有很多可能的解决方案:

There are many possible solutions:

1) 在查询的末尾添加 distribute by partition key.也许每个 reducer 有很多分区,每个 reducer 为每个分区创建文件.这也可以减少文件数量和内存消耗.hive.exec.reducers.bytes.per.reducer 设置将定义每个减速器将处理多少数据.

1) Add distribute by partition key at the end of your query. Maybe there are many partitions per reducer and each reducer creates files for each partition. This may reduce the number of files and memory consumption as well. hive.exec.reducers.bytes.per.reducer setting will define how much data each reducer will process.

2) 简单,如果数据不多就很好:添加order by强制单个reducer.或者增加hive.exec.reducers.bytes.per.reducer=500000000; --500M 文件.这个是针对单个reducer的解决方案,数据不是太多,数据多的话会跑慢.

2) Simple, quite good if there are not too much data: add order by to force single reducer. Or increase hive.exec.reducers.bytes.per.reducer=500000000; --500M files. This is for single reducer solution is for not too much data, it will run slow if there are a lot of data.

如果您的任务仅限地图,那么最好考虑选项 3-5:

If your task is map-only then better consider options 3-5:

3) 如果在 mapreduce 上运行,开启合并:

3) If running on mapreduce, switch-on merge:

set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=500000000;  --Size of merged files at the end of the job
set hive.merge.smallfiles.avgsize=500000000; --When the average output file size of a job is less than this number, 
--Hive will start an additional map-reduce job to merge the output files into bigger files

4) 在 Tez 上运行时

4) When running on Tez

set hive.merge.tezfiles=true; 
set hive.merge.size.per.task=500000000;
set hive.merge.smallfiles.avgsize=500000000;

5) 对于 ORC 文件,您可以使用以下命令有效地合并文件:ALTER TABLE T [PARTITION partition_spec] CONCATENATE; - 用于 ORC

5) For ORC files you can merge files efficiently using this command: ALTER TABLE T [PARTITION partition_spec] CONCATENATE; - for ORC

这篇关于使用一个文件在 Hive 中创建表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆