按现有字段分区Hive表? [英] Partition Hive table by existing field?

查看:595
本文介绍了按现有字段分区Hive表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以在插入现有字段时对Hive表进行分区吗?



我有一个10 GB的文件,包含日期字段和时间字段。我可以将这个文件加载到一个表中,然后插入覆盖到另一个使用这些字段作为分区的分区表中吗?会像下面这样工作吗?




$ b

  INSERT OVERWRITE TABLE tealeaf_event PARTITION(dt = evt.datestring ,hour = evt.hour)
SELECT * FROM staging_event evt; b



$ b

谢谢!

解决方案

我只是遇到了这个问题,试图回答同样的问题,这很有帮助,但并不完整。简短的答案是肯定的,就像问题中的查询会起作用,但语法不太正确。



假设您有三个表格,它们是使用以下语句创建的:

  CREATE TABLE staging_unpartitioned(datestring string,hour int,int,b int); 

CREATE TABLE staging_partitioned(a int,b int)
PARTITIONED BY(datestring string,hour int);

CREATE TABLE production_partitioned(a int,b int)
PARTITIONED BY(dt string,hour int);

a b 只是一些示例列。 dt 小时是我们希望一旦进入生产表时进行分区的值。从 staging_unpartitioned staging_partitioned 移动暂存数据到生产中看起来完全一样。

  INSERT OVERWRITE TABLE production_partitioned PARTITION(dt,hour)
SELECT a,b,datestring,hour FROM staging_unpartitioned;

INSERT OVERWRITE TABLE production_partitioned PARTITION(dt,hour)
SELECT a,b,datestring,hour FROM staging_partitioned;

这个过程称为动态分区,您可以阅读在这里。重要的是要注意哪些列与哪些分区是由SELECT顺序决定的。所有动态分区必须最后选择并按顺序进行。



当您尝试运行上面的代码时,由于您设置的属性会出现错误。首先,如果禁用了动态分区,它将无法工作,因此请确保:

  set hive.exec.dynamic.partition =真正; 

然后,如果您在动态分区之前未对至少一个静态分区进行分区,则可能会出现错误分区。当您打算使用动态分区覆盖其子分区时,此限制可以避免意外删除根分区。根据我的经验,这种行为从来没有帮助过,经常令人讨厌,但是你的里程可能会有所不同。无论如何,很容易改变:

  set hive.exec.dynamic.partition.mode = nonstrict; 

这应该可以做到。


Can I partition a Hive table upon insert by an existing field?

I have a 10 GB file with a date field and an hour of day field. Can I load this file into a table, then insert-overwrite into another partitioned table that uses those fields as a partition? Would something like the following work?

INSERT OVERWRITE TABLE tealeaf_event  PARTITION(dt=evt.datestring,hour=evt.hour) 
SELECT * FROM staging_event evt;

Thanks!

Travis

解决方案

I just ran across this trying to answer the same question and it was helpful but not quite complete. The short answer is yes, something like the query in the question will work but the syntax is not quite right.

Say you have three tables which were created using the following statements:

CREATE TABLE staging_unpartitioned (datestring string, hour int, a int, b int);

CREATE TABLE staging_partitioned (a int, b int) 
    PARTITIONED BY (datestring string, hour int);

CREATE TABLE production_partitioned (a int, b int) 
    PARTITIONED BY (dt string, hour int);

Columns a and b are just some example columns. dt and hour are the values we want to partition on once it gets to the production table. Moving the staging data to production from staging_unpartitioned and staging_partitioned looks exactly the same.

INSERT OVERWRITE TABLE production_partitioned PARTITION (dt, hour)
    SELECT a, b, datestring, hour FROM staging_unpartitioned;

INSERT OVERWRITE TABLE production_partitioned PARTITION (dt, hour)
    SELECT a, b, datestring, hour FROM staging_partitioned;

This uses a process called Dynamic Partitioning which you can read about here. The important thing to note is that which columns are associated with which partitions is determined by the SELECT order. All dynamic partitions must be selected last and in order.

There's a good chance when you try to run the code above you will hit an error due to the properties you have set. First, it will not work if you have dynamic partitioning disabled so make sure to:

set hive.exec.dynamic.partition=true;

Then you might hit an error if you aren't partitioning on at least one static partition before the dynamic partitions. This restriction would save you accidentally removing a root partition when you meant to overwrite its sub-partitions with dynamic partitions. In my experience this behavior has never been helpful and has often been annoying, but your mileage may vary. At any rate, it is easy to change:

set hive.exec.dynamic.partition.mode=nonstrict;

And that should do it.

这篇关于按现有字段分区Hive表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆