通过确切插入的时间戳在 Hive 中进行动态分区 [英] Dynamic partitioning in Hive through the exact inserted timestamp
问题描述
我需要将数据插入给定的外部表,该表应按插入日期进行分区.我的问题是 Hive 如何处理时间戳生成?当我为所有插入的记录选择时间戳时:
I need to insert data to a given external table which should be partitioned by the inserted date. My question is how is Hive handling the timestamp generation? When I select a timestamp for all inserted records like this:
WITH delta_insert AS (
SELECT trg.*, from_unixtime(unix_timestamp()) AS generic_timestamp
FROM target_table trg
)
SELECT *
FROM delta_insert;
所有记录的时间戳是否总是相同的,即使查询需要很多时间来取消?
Will the timestamp always be identical for all records, even if the query takes a lot of time to un?
或者我应该只选择一个实际的时间戳并将其与之后选择的所有内容结合起来?
Or should I alternatively only select an actual timestamp and join it with everything that is selected afterwards?
WITH insert_timestamp AS (
SELECT from_unixtime(unix_timestamp()) AS insert_timestamp
),
delta_insert AS (
SELECT trg.*, insert_timestamp.insert_timestamp
FROM target_table trg, insert_timestamp
)
SELECT *
FROM delta_insert;
由于不建议在 Hive 中进行交叉联接,我想知道最好的方法是什么,因为我没有足够的测试数据来模拟长时间运行的查询以确保它们插入到增量的同一分区中加载.
Since it's not recommended to do cross joins in Hive, I wonder what would be the best approach, since I don't have enough test data to simulate long running queries to be sure they get inserted within the same partition of a delta load.
推荐答案
不,使用 unix_timestamp() 生成的时间戳对于所有记录不会总是相同.此函数是非确定性的,会阻止对查询进行适当的优化 - 自 2.0 以来已弃用此功能,以支持 CURRENT_TIMESTAMP 常量(仅为查询计算一次).请参阅此文档:日期函数这不适用于带参数的 unix_timestamp(string date)
.unix_timestamp(string date)
函数是确定性的.
No, the timestamp generated using unix_timestamp() WILL NOT always be identical for all records. This function is non-deterministic and prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant (calculated only one time for the query). See this docs: Date functions
This not applies to unix_timestamp(string date)
with argument. unix_timestamp(string date)
function is deterministic.
current_timestamp
返回查询评估开始时的当前时间戳(从 Hive 1.2.0 开始).在同一查询中对 current_timestamp 的所有调用都返回相同的值.如果您需要日期,请使用 current_date
函数.
current_timestamp
returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value.
If you need date, use current_date
function.
这篇关于通过确切插入的时间戳在 Hive 中进行动态分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!