按时间戳字段划分的表 [英] Table Partitioned by Timestamp Field

查看:109
本文介绍了按时间戳字段划分的表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了生成一些摘要图,我们定期将数据导入Hive.我们当前使用的是CSV文件格式,其布局如下:

In order to generate some summary figures we are importing data periodically to Hive. We are currently using a CSV file format and its layout is as follows:

operation,item,timestamp,user,marketingkey

当前,我们有一些查询正在对时间戳字段的日期(yyyy-mm-dd)进行分组.

Currently we have a few queries that are performing grouping over date (yyyy-mm-dd) of timestamp field.

导入的文件有时会保存几天,我想以分区方式存储它.有没有办法用Hive做到这一点,我已经基于以下DDL构建了表格:

The files that are being imported are holding sometimes more days and I would like to store it in a partitioned way. Is there a way to do it with Hive, I have build the table based on the following DDL:

CREATE TABLE 
   partitionedTable (name string) 
PARTITIONED BY (time bigint) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

数据加载方式如下:

LOAD DATA LOCAL INPATH 
   '/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv' 
INTO TABLE partitionedTable PARTITION(time='2013-05-01');

但是我希望配置单元根据要导入的文件中的字段自动应用分区.例如:

But I would like that hive applies the partitioning in an automatic way based on a field that comes within the file that is being imported. For example:

login,1,1370793184,user1,none --> stored to partition 2013-06-09
login,2,1360793184,user1,none --> stored to partition 2013-02-13
login,1,1360571184,user2,none --> stored to partition 2013-02-11
buy,2,1360501184,user2,key1   --> stored to partition 2013-02-10

推荐答案

似乎您正在寻找动态分区,并且Hive支持动态分区插入,详情请参见

It seems like you are looking for dynamic partitioning, and Hive supports dynamic partition inserts as detailed in this article.

首先,您需要创建一个临时表,您将在其中放置完全没有分区的平面数据.在您的情况下,这将是:

First, you need to create a temporary table where you will put your flat data with no partition at all. In your case this would be:

CREATE TABLE 
    flatTable (type string, id int, ts bigint, user string, key string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

然后,您应该将平面数据文件加载到以下目录:

Then, you should load your flat data file into this directory:

LOAD DATA LOCAL INPATH
    '/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv'
INTO TABLE flatTable;

此时,您可以使用动态分区插入.需要记住的几件事是您需要以下属性:

At that point you can use the dynamic partition insert. A few things to keep in mind are that you'll need the following properties:

    我相信
  • hive.exec.dynamic.partition应该设置为true,因为默认情况下动态分区是禁用的.
  • hive.exec.dynamic.partition.mode应该设置为nonstrict,因为您只有一个分区,而严格模式强制您需要一个静态分区.
  • hive.exec.dynamic.partition should be set to true because dynamic partition is disabled by default I believe.
  • hive.exec.dynamic.partition.mode should be set to nonstrict because you have a single partition and strict mode enforces that you need one static partition.

因此您可以运行以下查询:

So you can run the following query:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM
    flatTable
INSERT OVERWRITE TABLE
    partitionedTable
PARTITION(time)
SELECT
    user, from_unixtime(ts, 'yyyy-MM-dd') AS time

这应该会产生2个MapReduce作业,最后您应该看到类似以下内容的东西:

This should spawn 2 MapReduce jobs, and at the end you should see something along the lines of:

Loading data to table default.partitionedtable partition (time=null)
    Loading partition {time=2013-02-10}
    Loading partition {time=2013-02-11}
    Loading partition {time=2013-02-13}
    Loading partition {time=2013-06-09}

并验证您的分区确实在这里:

And to verify that your partitions are indeed here:

$ hadoop fs -ls /user/hive/warehouse/partitionedTable/
Found 4 items
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-10
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-11
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-13
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-06-09

请注意,自Hive 0.6起仅支持动态分区,因此,如果您使用的是旧版本,则可能无法正常工作.

Please note that dynamic partitions are only supported since Hive 0.6, so if you have an older version this is probably not going to work.

这篇关于按时间戳字段划分的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆