配置单元句柄如何插入内部分区表中? [英] How does hive handle insert into internal partition table?

查看:88
本文介绍了配置单元句柄如何插入内部分区表中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将记录流插入到Hive分区表中.表结构类似于

CREATE TABLE store_transation (
     item_name string,
     item_count int,
     bill_number int,
) PARTITIONED BY (
   yyyy_mm_dd string
);

我想了解Hive如何处理内部表中的插入.

是否将所有记录插入到文件内的单个文件中 yyyy_mm_dd = 2018_08_31 目录?还是配置单元在一个分区内拆分为多个文件?

如果每天有1百万条记录,并且查询模式介于日期范围之间,那么哪一项在以下各项中表现良好?

  1. 内部表中没有分区
  2. 按日期分区,每个日期只有一个文件
  3. 按日期分区,每个日期有多个文件

解决方案

在所有情况下,Insert都会执行相同的操作,因为insert不会查询您现有的数据(除非您使用自身的select进行插入),并且通常每个容器都会除非配置了文件合并,否则创建它自己的文件.

如果按日期对表进行分区,则使用日期范围查询数据会更好.文件过多可能会导致性能下降,因此您可能希望在插入过程中合并文件.什么是太多文件?每天每个分区可能有数百甚至数千个分区.每个分区只有很少的文件不会导致性能问题,您不需要合并它们.

Insert语句将在partitions目录中创建其他文件,并且通常不会与现有文件合并.将创建多少个文件取决于insert语句和配置设置.

正在运行的最终映射器或化简器的数量+配置设置将确定输出文件的数量.您可以通过添加"order by"子句来强制其在单个化简器上运行.在这种情况下,将为每个分区创建一个额外的文件,但是它将运行缓慢. 也可以使用distribute by partition key来减少创建的文件数量,但这将触发额外的reducer阶段,并且比仅map任务运行得更慢.

您还可以使用以下设置将新文件与现有文件合并:

SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.size.per.task=128000000; -- (128MB)
SET hive.merge.smallfiles.avgsize=128000000; -- (128MB)

这些配置设置可能最终会触发合并任务(取决于上述设置中配置的大小),它将合并现有文件以及通过插入新添加的文件.

有关合并的更多详细信息,请参见以下答案: https://stackoverflow.com/a/45266244/2700344

实际上,在这种情况下,表的类型(托管表或外部表)无关紧要.插入或选择将起作用.

如果您已经具有与目标表相同格式的文件,那么最快的方法是将它们放置在分区目录中,而根本不使用DML查询.

对于ORC文件,您可以使用以下命令有效地合并文件:ALTER TABLE T [PARTITION partition_spec] CONCATENATE;

I have a requirement to insert streaming of records into Hive partitioned table. The table structure is something like

CREATE TABLE store_transation (
     item_name string,
     item_count int,
     bill_number int,
) PARTITIONED BY (
   yyyy_mm_dd string
);

I would like to understand how Hive handles inserts in the internal table.

Does all record insert into a single file inside the yyyy_mm_dd=2018_08_31 directory? Or hive splits into multiple files inside a partition, if so when?

Which one performs well from the following in case of per day 1 million records and the querying pattern is going to be between date range?

  1. No partition in the internal table
  2. Partition by date, each date has only one file
  3. Partition by date, each date has more than one file

解决方案

Insert will perform the same in all your cases because insert does not not query your existing data (unless you are inserting using select from itself) and normally each container will create it's own file unless file merge is configured.

Querying the data using date range will perform better if the table is partitioned by date. Too many files may cause performance degradation, so you may want to merge files during insert. What is too many files? Like hundreds or even thousands per daily partition. Having few files per partition will not cause performance problems, you do not need to merge them.

Insert statement will create additional file(s) in the partitions directory and normally will not merge with existing files. How many files will be created depends on the insert statement and configuration settings.

The number of final mappers or reducers running + configuration settings will determine the number of output files. You can force it running on single reducer by, for example, adding 'order by' clause. In such case one additional file per partition will be created but it will work slow. Also distribute by partition key can be used to reduce the number of files created, but this will trigger additional reducer stage and will work slower than map-only task.

Also you can merge new files with existing using these settings:

SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.size.per.task=128000000; -- (128MB)
SET hive.merge.smallfiles.avgsize=128000000; -- (128MB)

These configuration settings may trigger merge task (depending on sizes configured in above settings) at the end and it will merge existing files as well as newly added by insert.

See this answer for more details about merge: https://stackoverflow.com/a/45266244/2700344

Actually the type of table, managed or external, does not matter in this context. Insert or select will work the same.

If you already have files in the same format as target table, then the fastest way is to put them in partitions directories without using DML query at all.

For ORC files you can merge files efficiently using this command: ALTER TABLE T [PARTITION partition_spec] CONCATENATE;

这篇关于配置单元句柄如何插入内部分区表中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆