HIVE - INSERT OVERWRITE vs DROP TABLE + CREATE TABLE + INSERT INTO [英] HIVE - INSERT OVERWRITE vs DROP TABLE + CREATE TABLE + INSERT INTO

查看:26
本文介绍了HIVE - INSERT OVERWRITE vs DROP TABLE + CREATE TABLE + INSERT INTO的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 hive 中做了一些自动脚本的一些查询,我们发现我们需要不时地从表中清除数据并插入新的.我们正在考虑什么可以更快?

I'm doing some automatic script of few queries in hive and we found that we need time to time clear the data from a table and insert the new one. And we are thinking what could be faster?

INSERT OVERWRITE TABLE SOME_TABLE
    SELECT * FROM OTHER_TABLE;

或者这样做更快:

DROP TABLE SOME_TABLE;
CREATE TABLE SOME_TABLE (STUFFS);
INSERT INTO TABLE
    SELECT * FROM OTHER_TABLE;

运行查询的开销不是问题.由于我们也有创建脚本.问题是,十亿行的 INSERT OVERWRITEDROP + CREATE + INSERT INTO 快吗?

The overhead of running the queries is not an issue. Due to we have the script o creation too. The question is, the INSERT OVERWRITE with billion of rows is faster than DROP + CREATE + INSERT INTO?

推荐答案

为了获得最大速度,我建议 1) 发出 hadoop fs -rm -r -skipTrash table_dir/* 首先删除旧数据快速而不将文件放入垃圾箱,因为 INSERT OVERWRITE 会将所有文件放入垃圾箱,对于非常大的表,这将花费大量时间.然后 2) 执行 INSERT OVERWRITE 命令.这也会更快,因为您不需要删除/创建表.

For maximum speed I would suggest to 1) issue hadoop fs -rm -r -skipTrash table_dir/* first to remove old data fast without putting files into trash because INSERT OVERWRITE will put all files into Trash and for very big table this will take a lot of time. Then 2) do INSERT OVERWRITE command. This will be faster also because you do not need to drop/create table.

更新:

从 Hive 2.3.0 (HIVE-15880) 开始,如果表具有 TBLPROPERTIES ("auto.purge"="true"),则该表的先前数据不会移动到垃圾箱INSERT OVERWRITE 查询是针对表运行的.此功能仅适用于托管表.因此,具有自动清除功能的 INSERT OVERWRITE 将比 rm -skipTrash + INSERT OVERWRITEDROP+CREATE+ 更快地工作INSERT 因为它将是一个 Hive-only 命令.

As of Hive 2.3.0 (HIVE-15880), if the table has TBLPROPERTIES ("auto.purge"="true") the previous data of the table is not moved to Trash when INSERT OVERWRITE query is run against the table. This functionality is applicable only for managed tables. So, INSERT OVERWRITE with auto purge will work faster than rm -skipTrash + INSERT OVERWRITE or DROP+CREATE+INSERT because it will be a single Hive-only command.

这篇关于HIVE - INSERT OVERWRITE vs DROP TABLE + CREATE TABLE + INSERT INTO的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆