使用Hive日期函数而不是硬编码日期字符串时,Hive查询性能很慢? [英] Hive query performance is slow when using Hive date functions instead of hardcoded date strings?
问题描述
我有一个每天更新的事务表 table_A
。每天我都会使用 file_date $ c>从外部
table_B
插入新数据到 table_A
$ c>字段过滤来自外部 table_B
的必要数据以插入到 table_A
中。然而,如果我使用硬编码日期而不是使用Hive日期函数,则会有巨大的性能差异:
- 快速版〜20分钟)
SET date_ingest ='2016-12-07';
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;
INSERT
INTO
TABLE
table_A PARTITION(FILE_DATE)SELECT
id,eventtime
,CONCAT_WS(' - ',substr(eventtime ,0,4),SUBSTRING(eventtime,5,2),SUBSTRING(eventtime,7,2))
FROM
table_B
WHERE
file_date = $ {hiveconf:date_ingest }
;
相比:
- 慢版(〜9小时)
SET date_ingest = date_add(to_date(from_unixtime(unix_timestamp())), - 1);
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;
INSERT
INTO
TABLE
table_A PARTITION(FILE_DATE)SELECT
id,eventtime
,CONCAT_WS(' - ',substr(eventtime ,0,4),SUBSTRING(eventtime,5,2),SUBSTRING(eventtime,7,2))
FROM
table_B
WHERE
file_date = $ {hiveconf:date_ingest }
;
有没有人遇到类似的问题?由于我们使用的是第三方UI,因此您应该假设我无法访问Unix配置单元命令(即无法使用--hiveconf选项)。
有时,在filter子句中使用函数时,分区修剪不起作用。如果你计算wrapper shell脚本中的变量并将它作为-hiveconf变量传递给Hive,它将正常工作。
示例:
$ p $ #inside shell脚本
date_ingest = $(date -d'-1 day' +%Y-%m-%d)
hive -f your_script.hql -hiveconf date_ingest =$ date_ingest
然后在Hive脚本中使用它作为 WHERE file_date ='$ {hiveconf:date_ingest}'
I have a transaction table table_A
that gets updated every day. Every day I insert new data into table_A
from external table_B
using the file_date
field to filter the necessary data from external table_B
to insert into table_A
. However, there's a huge performance difference if I use a hardcoded date vs. using the Hive date functions:
-- Fast version (~20 minutes)
SET date_ingest = '2016-12-07';
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;
INSERT
INTO
TABLE
table_A PARTITION (FILE_DATE) SELECT
id, eventtime
,CONCAT_WS( '-' ,substr ( eventtime ,0 ,4 ) ,SUBSTRING( eventtime ,5 ,2 ) ,SUBSTRING( eventtime ,7 ,2 ) )
FROM
table_B
WHERE
file_date = ${hiveconf:date_ingest}
;
compared to:
-- Slow version (~9 hours)
SET date_ingest = date_add(to_date(from_unixtime( unix_timestamp( ) )),-1);
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;
INSERT
INTO
TABLE
table_A PARTITION (FILE_DATE) SELECT
id, eventtime
,CONCAT_WS( '-' ,substr ( eventtime ,0 ,4 ) ,SUBSTRING( eventtime ,5 ,2 ) ,SUBSTRING( eventtime ,7 ,2 ) )
FROM
table_B
WHERE
file_date = ${hiveconf:date_ingest}
;
Has anyone experienced similar issues? You should assume that I don't have access to the Unix hive command (i.e. can't use --hiveconf options) since we're using a third party UI.
Sometimes partition pruning does not work when using functions in filter clause. If you calculate the variable in the wrapper shell script and pass it as -hiveconf variable to the Hive, it will work fine. Example:
#inside shell script
date_ingest=$(date -d '-1 day' +%Y-%m-%d)
hive -f your_script.hql -hiveconf date_ingest="$date_ingest"
Then use it inside Hive script as WHERE file_date ='${hiveconf:date_ingest}'
这篇关于使用Hive日期函数而不是硬编码日期字符串时,Hive查询性能很慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!