分区表中的 Hive 加载 [英] Hive loading in partitioned table
问题描述
我在 HDFS 中有一个日志文件,值以逗号分隔.例如:
I have a log file in HDFS, values are delimited by comma. For example:
2012-10-11 12:00,opened_browser,userid111,deviceid222
现在我想将此文件加载到 Hive 表中,该表具有timestamp"、action"列并按userid"、deviceid"进行分区.如何让 Hive 将日志文件中的最后 2 列作为表的分区?所有示例 e.g.hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE 邀请分区 (ds='2008-08-15');"
需要在脚本中定义分区,但我想要分区从 HDFS 文件自动设置.
Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid". How can I ask Hive to take that last 2 columns in log file as partition for table? All examples e.g. "hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');"
require definition of partitions in the script, but I want partitions to set up automatically from HDFS file.
一个解决方案是创建包含所有 4 列的中间非分区表,从文件中填充它,然后将 INSERT 插入到 first_table PARTITION (userid,deviceid) select from中间表时间戳,操作,用户标识,设备标识;
但那是额外的任务,我们将有 2 个非常相似的表.. 或者我们应该创建外部表作为中间表.
The one solution is to create intermediate non-partitioned table with all that 4 columns, populate it from file and then make an INSERT into first_table PARTITION (userid,deviceid) select from intermediate_table timestamp,action,userid,deviceid;
but that is and additional task and we will have 2 very similiar tables.. Or we should create external table as intermediate.
推荐答案
Ning Zhang 在 http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-动态分区同时加载数据到表中.
Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables.
快速上下文是:
- 加载数据只是复制数据,它不会读取数据,因此无法确定要分区的内容
- 建议您先将数据加载到中间表中(或使用指向所有文件的外部表),然后让分区动态插入启动以将其加载到分区表中
这篇关于分区表中的 Hive 加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!