Hive分区修剪计算列 [英] Hive partition pruning on computed column
问题描述
我在Hive上有几张表,我的查询正在尝试检索过去x天的数据。
选择Hive在使用直接日期时修剪分区,但在使用公式时进行全表扫描。 *
from f_event
where date_key> 20160101;
扫描分区..
s3://...key=20160102 [f]
s3://...key=20160103 [f]
s3://...key=20160104 [f]
如果我使用公式,例如,获取过去4周的数据
选择次数(*)
从f_event f
where date_key> from_unixtime(unix_timestamp() - 2 * 7 * 60 * 60 * 24,'yyyyMMdd')
这是扫描表中的所有分区。
环境:Hadoop 2.6.0,EMR,S3上的Hive,Hive 1.0.0
unix_timestamp()<>)时,Hive不会触发分区修剪。 / code>。
一个很好的理由在讨论:
想象一下, :
WHERE partition_column = f(unix_timestamp())AND ordinary_column =
f(unix_timestamp)
/ p>
谓词的右侧必须在map-time,
被评估,而你假设左手侧应该被评估为
编译时间,这意味着y你有两个不同的
unix_timestamp()浮动值,它只能很糟糕地结束。
I have a few tables on Hive and my query is trying to retrieve the data for the past x days. Hive is pruning the partitions when I use a direct date, but is doing a full table scan when using a formula instead.
select *
from f_event
where date_key > 20160101;
scanned partitions..
s3://...key=20160102 [f]
s3://...key=20160103 [f]
s3://...key=20160104 [f]
If I use a formula, say, to get the past 4 weeks of data
Select count(*)
From f_event f
Where date_key > from_unixtime(unix_timestamp()-2*7*60*60*24, 'yyyyMMdd')
This is scanning all partitions in the table.
environment : Hadoop 2.6.0, EMR, Hive on S3, Hive 1.0.0
解决方案 Hive doesn't trigger partition pruning when the filtering expression contains non-deterministic functions such as unix_timestamp()
.
A good reason for this was mentioned in the discussion:
Imagine a situation where you had:
WHERE partition_column = f(unix_timestamp()) AND ordinary_column =
f(unix_timestamp)
.
The right hand side of the predicate has to be evaluated at map-time,
whereas you're assuming that left hand side should be evaluated at
compile time, which means you have two different values of
unix_timestamp() floating around, which can only end badly.
这篇关于Hive分区修剪计算列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!