使用Hive日期函数而不是硬编码日期字符串时,Hive查询性能很慢? [英] Hive query performance is slow when using Hive date functions instead of hardcoded date strings?

查看:263
本文介绍了使用Hive日期函数而不是硬编码日期字符串时,Hive查询性能很慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个每天更新的事务表 table_A 。每天我都会使用 file_date 从外部 table_B 插入新数据到 table_A $ c>字段过滤来自外部 table_B 的必要数据以插入到 table_A 中。然而,如果我使用硬编码日期而不是使用Hive日期函数,则会有巨大的性能差异:

   - 快速版〜20分钟)
SET date_ingest ='2016-12-07';
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;

INSERT
INTO
TABLE
table_A PARTITION(FILE_DATE)SELECT
id,eventtime
,CONCAT_WS(' - ',substr(eventtime ,0,4),SUBSTRING(eventtime,5,2),SUBSTRING(eventtime,7,2))
FROM
table_B
WHERE
file_date = $ {hiveconf:date_ingest }
;

相比:

   - 慢版(〜9小时)
SET date_ingest = date_add(to_date(from_unixtime(unix_timestamp())), - 1);
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;

INSERT
INTO
TABLE
table_A PARTITION(FILE_DATE)SELECT
id,eventtime
,CONCAT_WS(' - ',substr(eventtime ,0,4),SUBSTRING(eventtime,5,2),SUBSTRING(eventtime,7,2))
FROM
table_B
WHERE
file_date = $ {hiveconf:date_ingest }
;

有没有人遇到类似的问题?由于我们使用的是第三方UI,因此您应该假设我无法访问Unix配置单元命令(即无法使用--hiveconf选项)。

解决方案

有时,在filter子句中使用函数时,分区修剪不起作用。如果你计算wrapper shell脚本中的变量并将它作为-hiveconf变量传递给Hive,它将正常工作。
示例:

$ p $ #inside shell脚本
date_ingest = $(date -d'-1 day' +%Y-%m-%d)
hive -f your_script.hql -hiveconf date_ingest =$ date_ingest

然后在Hive脚本中使用它作为 WHERE file_date ='$ {hiveconf:date_ingest}'


I have a transaction table table_A that gets updated every day. Every day I insert new data into table_A from external table_B using the file_date field to filter the necessary data from external table_B to insert into table_A. However, there's a huge performance difference if I use a hardcoded date vs. using the Hive date functions:

-- Fast version (~20 minutes)
SET date_ingest = '2016-12-07';
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;

INSERT
    INTO
        TABLE
            table_A PARTITION (FILE_DATE) SELECT
                    id, eventtime
                    ,CONCAT_WS( '-' ,substr ( eventtime ,0 ,4 ) ,SUBSTRING( eventtime ,5 ,2 ) ,SUBSTRING( eventtime ,7 ,2 ) )
                FROM
                    table_B
                WHERE
                    file_date = ${hiveconf:date_ingest}
;

compared to:

-- Slow version (~9 hours)
SET date_ingest = date_add(to_date(from_unixtime( unix_timestamp( ) )),-1);
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = TRUE;

INSERT
    INTO
        TABLE
            table_A PARTITION (FILE_DATE) SELECT
                    id, eventtime
                    ,CONCAT_WS( '-' ,substr ( eventtime ,0 ,4 ) ,SUBSTRING( eventtime ,5 ,2 ) ,SUBSTRING( eventtime ,7 ,2 ) )
                FROM
                    table_B
                WHERE
                    file_date = ${hiveconf:date_ingest}
;

Has anyone experienced similar issues? You should assume that I don't have access to the Unix hive command (i.e. can't use --hiveconf options) since we're using a third party UI.

解决方案

Sometimes partition pruning does not work when using functions in filter clause. If you calculate the variable in the wrapper shell script and pass it as -hiveconf variable to the Hive, it will work fine. Example:

#inside shell script
date_ingest=$(date -d '-1 day' +%Y-%m-%d)
hive -f your_script.hql -hiveconf date_ingest="$date_ingest" 

Then use it inside Hive script as WHERE file_date ='${hiveconf:date_ingest}'

这篇关于使用Hive日期函数而不是硬编码日期字符串时,Hive查询性能很慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆