分区表中的 Hive 加载 [英] Hive loading in partitioned table

查看:35
本文介绍了分区表中的 Hive 加载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 HDFS 中有一个日志文件,值以逗号分隔.例如:

I have a log file in HDFS, values are delimited by comma. For example:

2012-10-11 12:00,opened_browser,userid111,deviceid222

现在我想将此文件加载到 Hive 表中,该表具有timestamp"、action"列并按userid"、deviceid"进行分区.如何让 Hive 将日志文件中的最后 2 列作为表的分区?所有示例 e.g.hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE 邀请分区 (ds='2008-08-15');" 需要在脚本中定义分区,但我想要分区从 HDFS 文件自动设置.

Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid". How can I ask Hive to take that last 2 columns in log file as partition for table? All examples e.g. "hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');" require definition of partitions in the script, but I want partitions to set up automatically from HDFS file.

一个解决方案是创建包含所有 4 列的中间非分区表,从文件中填充它,然后将 INSERT 插入到 first_table PARTITION (userid,deviceid) select from中间表时间戳,操作,用户标识,设备标识; 但那是额外的任务,我们将有 2 个非常相似的表.. 或者我们应该创建外部表作为中间表.

The one solution is to create intermediate non-partitioned table with all that 4 columns, populate it from file and then make an INSERT into first_table PARTITION (userid,deviceid) select from intermediate_table timestamp,action,userid,deviceid; but that is and additional task and we will have 2 very similiar tables.. Or we should create external table as intermediate.

推荐答案

Ning Zhang 在 http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-动态分区同时加载数据到表中.

Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables.

快速上下文是:

  1. 加载数据只是复制数据,它不会读取数据,因此无法确定要分区的内容
  2. 建议您先将数据加载到中间表中(或使用指向所有文件的外部表),然后让分区动态插入启动以将其加载到分区表中

这篇关于分区表中的 Hive 加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆