通过确切的插入时间戳在Hive中进行动态分区 [英] Dynamic partitioning in Hive through the exact inserted timestamp

查看:68
本文介绍了通过确切的插入时间戳在Hive中进行动态分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将数据插入给定的外部表,该表应按插入的日期进行分区.我的问题是Hive如何处理时间戳生成?当我为所有插入的记录选择时间戳时,如下所示:

I need to insert data to a given external table which should be partitioned by the inserted date. My question is how is Hive handling the timestamp generation? When I select a timestamp for all inserted records like this:

WITH delta_insert AS (
        SELECT trg.*, from_unixtime(unix_timestamp()) AS generic_timestamp
        FROM target_table trg
        )
SELECT *
FROM delta_insert;

即使查询要花费很多时间才能使所有记录的时间戳始终相同?

Will the timestamp always be identical for all records, even if the query takes a lot of time to un?

还是我应该只选择一个实际的时间戳,然后将其与之后选择的所有内容一起加入?

Or should I alternatively only select an actual timestamp and join it with everything that is selected afterwards?

WITH insert_timestamp AS (
    SELECT from_unixtime(unix_timestamp()) AS insert_timestamp
),
delta_insert AS (
        SELECT trg.*, insert_timestamp.insert_timestamp
        FROM target_table trg, insert_timestamp
        )
SELECT *
FROM delta_insert;

由于不建议在Hive中进行交叉联接,所以我想知道哪种方法最好,因为我没有足够的测试数据来模拟长时间运行的查询,以确保将它们插入到增量的同一分区中加载.

Since it's not recommended to do cross joins in Hive, I wonder what would be the best approach, since I don't have enough test data to simulate long running queries to be sure they get inserted within the same partition of a delta load.

推荐答案

否,使用unix_timestamp()生成的时间戳记对于所有记录都不总是相同的.此函数是不确定性的,会阻止查询的适当优化-自2.0版以来已弃用此函数,而推荐使用CURRENT_TIMESTAMP常量(对于查询仅计算一次).查看此文档:日期函数这不适用于带有参数的 unix_timestamp(字符串日期). unix_timestamp(字符串日期)函数是确定性的.

No, the timestamp generated using unix_timestamp() WILL NOT always be identical for all records. This function is non-deterministic and prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant (calculated only one time for the query). See this docs: Date functions This not applies to unix_timestamp(string date) with argument. unix_timestamp(string date) function is deterministic.

current_timestamp 返回查询评估开始时的当前时间戳(自Hive 1.2.0起).同一查询中对current_timestamp的所有调用均返回相同的值.如果需要日期,请使用 current_date 函数.

current_timestamp returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value. If you need date, use current_date function.

这篇关于通过确切的插入时间戳在Hive中进行动态分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆