按时间戳字段划分的表 [英] Table Partitioned by Timestamp Field
问题描述
为了生成一些摘要图,我们定期将数据导入Hive.我们当前使用的是CSV文件格式,其布局如下:
In order to generate some summary figures we are importing data periodically to Hive. We are currently using a CSV file format and its layout is as follows:
operation,item,timestamp,user,marketingkey
当前,我们有一些查询正在对时间戳字段的日期(yyyy-mm-dd)进行分组.
Currently we have a few queries that are performing grouping over date (yyyy-mm-dd) of timestamp field.
导入的文件有时会保存几天,我想以分区方式存储它.有没有办法用Hive做到这一点,我已经基于以下DDL构建了表格:
The files that are being imported are holding sometimes more days and I would like to store it in a partitioned way. Is there a way to do it with Hive, I have build the table based on the following DDL:
CREATE TABLE
partitionedTable (name string)
PARTITIONED BY (time bigint)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
数据加载方式如下:
LOAD DATA LOCAL INPATH
'/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv'
INTO TABLE partitionedTable PARTITION(time='2013-05-01');
但是我希望配置单元根据要导入的文件中的字段自动应用分区.例如:
But I would like that hive applies the partitioning in an automatic way based on a field that comes within the file that is being imported. For example:
login,1,1370793184,user1,none --> stored to partition 2013-06-09
login,2,1360793184,user1,none --> stored to partition 2013-02-13
login,1,1360571184,user2,none --> stored to partition 2013-02-11
buy,2,1360501184,user2,key1 --> stored to partition 2013-02-10
推荐答案
似乎您正在寻找动态分区,并且Hive支持动态分区插入,详情请参见
It seems like you are looking for dynamic partitioning, and Hive supports dynamic partition inserts as detailed in this article.
首先,您需要创建一个临时表,您将在其中放置完全没有分区的平面数据.在您的情况下,这将是:
First, you need to create a temporary table where you will put your flat data with no partition at all. In your case this would be:
CREATE TABLE
flatTable (type string, id int, ts bigint, user string, key string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
然后,您应该将平面数据文件加载到以下目录:
Then, you should load your flat data file into this directory:
LOAD DATA LOCAL INPATH
'/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv'
INTO TABLE flatTable;
此时,您可以使用动态分区插入.需要记住的几件事是您需要以下属性:
At that point you can use the dynamic partition insert. A few things to keep in mind are that you'll need the following properties:
-
我相信
-
hive.exec.dynamic.partition
应该设置为true
,因为默认情况下动态分区是禁用的. -
hive.exec.dynamic.partition.mode
应该设置为nonstrict
,因为您只有一个分区,而严格模式强制您需要一个静态分区.
hive.exec.dynamic.partition
should be set totrue
because dynamic partition is disabled by default I believe.hive.exec.dynamic.partition.mode
should be set tononstrict
because you have a single partition and strict mode enforces that you need one static partition.
因此您可以运行以下查询:
So you can run the following query:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM
flatTable
INSERT OVERWRITE TABLE
partitionedTable
PARTITION(time)
SELECT
user, from_unixtime(ts, 'yyyy-MM-dd') AS time
这应该会产生2个MapReduce作业,最后您应该看到类似以下内容的东西:
This should spawn 2 MapReduce jobs, and at the end you should see something along the lines of:
Loading data to table default.partitionedtable partition (time=null)
Loading partition {time=2013-02-10}
Loading partition {time=2013-02-11}
Loading partition {time=2013-02-13}
Loading partition {time=2013-06-09}
并验证您的分区确实在这里:
And to verify that your partitions are indeed here:
$ hadoop fs -ls /user/hive/warehouse/partitionedTable/
Found 4 items
drwxr-xr-x - username supergroup 0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-10
drwxr-xr-x - username supergroup 0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-11
drwxr-xr-x - username supergroup 0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-13
drwxr-xr-x - username supergroup 0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-06-09
请注意,自Hive 0.6起仅支持动态分区,因此,如果您使用的是旧版本,则可能无法正常工作.
Please note that dynamic partitions are only supported since Hive 0.6, so if you have an older version this is probably not going to work.
这篇关于按时间戳字段划分的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!