在分区表上插入覆盖不会删除现有数据 [英] Insert overwrite on partitioned table is not deleting the existing data

查看:588
本文介绍了在分区表上插入覆盖不会删除现有数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在分区表上运行插入覆盖. 插入覆盖的选择查询完全省略了一个分区.这是预期的行为吗?

I am trying to run insert overwrite over a partitioned table. The select query of insert overwrite omits one partition completely. Is it the expected behavior?

表定义

CREATE TABLE `cities_red`(                                                              
  `cityid` int,                                                                         
  `city` string)                                                                        
PARTITIONED BY (                                                                        
  `state` string)                                                                       
ROW FORMAT SERDE                                                                        
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                                           
STORED AS INPUTFORMAT                                                                   
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                                     
OUTPUTFORMAT                                                                            
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'                                    
TBLPROPERTIES (                                                                         
  'auto.purge'='true',                                                                  
  'last_modified_time'='1555591782',                                                    
  'transient_lastDdlTime'='1555591782');  

表格数据

+--------------------+------------------+-------------------+--+
| cities_red.cityid  | cities_red.city  | cities_red.state  |
+--------------------+------------------+-------------------+--+
| 13                 | KARNAL           | HARYANA           |
| 13                 | KARNAL           | HARYANA           |
| 1                  | Nagpur           | MH                |
| 22                 | Mumbai           | MH                |
| 22                 | Mumbai           | MH                |
| 755                | BPL              | MP                |
| 755                | BPL              | MP                |
| 10                 | BANGLORE         | TN                |
| 10                 | BANGLORE         | TN                |
| 10                 | BANGLORE         | TN                |
| 10                 | BANGLORE         | TN                |
| 12                 | NOIDA            | UP                |
| 12                 | NOIDA            | UP                |
+--------------------+------------------+-------------------+--+

查询

insert overwrite table cities_red partition (state) select * from cities_red where city !='NOIDA';

它不会从表中删除任何数据

It does not delete any data from the table

insert overwrite table cities_red partition (state) select * from cities_red where city !='Mumbai';

它将从表中删除预期的2行.

It removes the expected 2 rows from the table.

在分区表的情况下,这是Hive的预期行为吗?

推荐答案

是的,这是预期的行为.

Yes, this is expected behavior.

Insert overwrite table partition select ,,,仅覆盖由select返回的数据集中现有的分区.

Insert overwrite table partition select ,,, overwrites only partitions existing in the dataset returned by select.

在示例分区state=UP中仅具有city='NOIDA'记录.过滤器where city !='NOIDA'从返回的数据集中删除整个state=UP分区,这就是为什么不重写它的原因.

In your example partition state=UP has records with city='NOIDA' only. Filter where city !='NOIDA' removes entire state=UP partition from the returned dataset and this is why it is not being rewritten.

过滤器city !='Mumbai'不会过滤整个分区,它会部分返回,这就是为什么它被过滤后的数据覆盖的原因.

Filter city !='Mumbai' does not filter entire partition, it is partially returned, this is why it is being overwritten with filtered data.

它按设计工作.考虑只需要覆盖所需分区的情况,这对于增量分区负载来说是很正常的.在这种情况下,您无需触摸其他分区. 您通常需要能够仅覆盖所需的分区.而且如果不覆盖不变的分区,则恢复起来可能会非常昂贵.

It works as designed. Consider scenario when you need to overwrite only desired partitions, this is quite normal for the incremental partition load. You do not need to touch other partitions in this case. You need to be able normally to overwrite only desired partitions. And without overwriting unchanged partitions, which can be very expensive to recover.

如果仍然要删除分区并修改现有分区中的数据,则可以删除/创建表(为此可能需要再创建一个中间表),然后将分区加载到其中. 或者,计算需要单独删除并执行ALTER TABLE DROP PARTITION的分区.

And if you still want to drop partitions and modify data in existing partitions, then you can drop/create table (you may need to create one more intermediate table for this) and then load partitions into it. Or alternatively calculate partitions which you need to drop separately and execute ALTER TABLE DROP PARTITION.

这篇关于在分区表上插入覆盖不会删除现有数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆