如何合并蜂巢中现有分区的小文件? [英] How to merge small files from existing partitions in hive?

查看:105
本文介绍了如何合并蜂巢中现有分区的小文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



例如,我有一个表user1,它包含列fname,lname和分区列是天。



我使用下面的脚本创建了表

CREATE TABLE user1(fname string,在将数据插入分区表后,它将如下所示。



pre> fname lname天
.....................
AA AAA 20170201 .... >分区20170201
BB BBB 20170201
...................
CC CCC 20170202 ......>分区20170202
DD DDD 20170202
....................
EE EEE 20170203 .......>分区20170203
FF FFF 20170203
.......................
GG GGG 20170204 ........> ;分区20170204
HH HHH 20170204
.......................

当我使用partition列(即day = 20170201)帮助执行select查询时。

  select * from user1 where day = 20170201; 

结果如下

  AA AAA 20170201 
BB BBB 20170201

基于上表我想合并所有的小文件,即日= 20170201和日= 20170202和日= 20170203分区日= 20170203在我的分区表(即USer1).ie它应该如下所示。

  fname lname日
............. ........
AA AAA 20170201
BB BBB 20170201
CC CCC 20170202
DD DDD 20170202
E EEE 20170203 ....... >分区20170203
FF FFF 20170203
.......................
GG GGG 20170204 ..... ...>分区20170204
HH HHH 20170204
.......................

可以请您就此提出建议,我该如何做到这一点?

提前。

解决方案


  1. 创建由新字段 partition_day

  2. $ b


      CREATE TABLE user_new(fname string, lname string,day int)parittioned By(partition_day int); 





    1. 加载数据导入新表(在情况情况下为新分区定义条件)
      $ b


       插入覆盖表user_new分区(partition_day)
      选择fname,lname,日期,
      当天日期<= 20170203然后20170203 $ b时$ b当天> 20170203然后20170204
      从user1结束为partition_day
      ;



      How to merge existing Partition small files into one large file in one of the Partition .

      For example I have a table user1, it contain columns fname,lname and partition column is day.

      I have created table by using below script

      CREATE TABLE user1(fname string,lname string) parittioned By (day int);

      After inserting data into partion table it will look like below.

       fname  lname  day
      .....................
      AA      AAA   20170201     ....>partition 20170201
      BB      BBB   20170201
      ...................
      CC      CCC   20170202    ......>partition 20170202
      DD      DDD   20170202
      ....................
      EE      EEE   20170203    .......>partition 20170203
      FF      FFF   20170203
      .......................
      GG      GGG   20170204    ........>partition 20170204         
      HH      HHH   20170204
      .......................
      

      When I execute select query with the help of partition column i.e. day=20170201.

      select * from user1 where day=20170201;
      

      It will give result like below

      AA      AAA   20170201
      BB      BBB   20170201
      

      based on above table i want to merge the all small files i.e day =20170201 and day =20170202 and day=20170203 into partition day=20170203 in my partition table (i.e USer1).i.e. It should look like below.

      fname  lname  day
      .....................
      AA      AAA   20170201
      BB      BBB   20170201
      CC      CCC   20170202    
      DD      DDD   20170202
      E       EEE   20170203    .......>partition 20170203
      FF      FFF   20170203
      .......................
      GG      GGG   20170204    ........>partition 20170204         
      HH      HHH   20170204
      .......................
      

      can you please suggest on this,How can I achieve this?

      Thanks in Advance.

      解决方案

      1. Create new table partitioned by new field partition_day:

      CREATE TABLE user_new(fname string,lname string, day int) parittioned By (partition_day int);
      

      1. Load data into new table (define your conditions for new partitionsin the case )

         insert overwrite table user_new partition (partition_day)
          select fname,lname, day,
                 case when day <= 20170203 then 20170203
                      when day >  20170203 then 20170204
                 end as partition_day
            from user1 ;
      

      这篇关于如何合并蜂巢中现有分区的小文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆