使用Hive分区表优化联接性能 [英] Optimize the join performance with Hive partition table

查看：72 发布时间：2021/5/14 19:04:38 performance hive query-optimization partitioning

本文介绍了使用Hive分区表优化联接性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有某些示例数据的Hive orc test_dev_db.TransactionUpdateTable表，该表将保存需要更新到主表(test_dev_db.TransactionMainHistoryTable)的增量数据，该主表已划分为Country，Tran_date列.

I have a Hive orc test_dev_db.TransactionUpdateTable table with some sample data, which will be holding increment data which needs to be updated to main table (test_dev_db.TransactionMainHistoryTable) which is partitioned on columns Country,Tran_date.

Hive增量负载表架构:它包含19行，需要合并.

Hive Incremental load table schema: It holds 19 rows which needs to be merge.

CREATE TABLE IF NOT EXISTS test_dev_db.TransactionUpdateTable 
(
Transaction_date timestamp,
Product       string,
Price         int,
Payment_Type  string,
Name          string, 
City          string,
State         string,
Country       string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;

Hive主表模式:总行数为77.

Hive main table schema: Total row counts 77.

CREATE TABLE IF NOT EXISTS test_dev_db.TransactionMainHistoryTable
(
Transaction_date timestamp,
Product       string,
Price         int,
Payment_Type  string,
Name          string,
City          string,
State         string
)
PARTITIONED BY (Country string,Tran_date string) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;

我正在查询下面运行，以将增量数据与主表合并.

I am running below query to merge the incremental data with main table.

SELECT
  case when i.transaction_date is not null then cast(substring(current_timestamp(),0,19) as timestamp)  
  else t.transaction_date   end as transaction_date,
  t.product,
  case when i.price is not null then i.price else t.price end as price,
  t.payment_type,
  t.name,
  t.city,
  t.state,
  t.country,
  case when i.transaction_date is not null then substring(current_timestamp(),0,10) 
  else t.tran_date end as tran_date
  from
test_dev_db.TransactionMainHistoryTable t
full join test_dev_db.TransactionUpdateTable i on (t.Name=i.Name)
;
/hdfs/path/database/test_dev_db.db/transactionmainhistorytable/country=Australia/tran_date=2009-03-01
/hdfs/path/database/test_dev_db.db/transactionmainhistorytable/country=Australia/tran_date=2009-05-01

并在下面的查询中运行以过滤出需要合并的特定分区，以免重写未更新的分区.

and running below query to filter out the specific partitions which needs to be merged, just to eliminate the rewriting the no updated partitions.

SELECT
  case when i.transaction_date is not null then cast(substring(current_timestamp(),0,19) as timestamp)  
  else t.transaction_date   end as transaction_date,
  t.product,
  case when i.price is not null then i.price else t.price end as price,
  t.payment_type,
  t.name,
  t.city,
  t.state,
  t.country,
  case when i.transaction_date is not null then substring(current_timestamp(),0,10) else t.tran_date end as tran_date
  from
(SELECT 
  *
  FROM 
test_dev_db.TransactionMainHistoryTable
where Tran_date in
(select distinct  from_unixtime(to_unix_timestamp (Transaction_date,'yyyy-MM-dd HH:mm'),'yyyy-MM-dd') from test_dev_db.TransactionUpdateTable
))t
full join test_dev_db.TransactionUpdateTable i on (t.Name=i.Name)
;

在两种情况下，仅

仅需更新Transaction_date，Price和分区列tran_date.尽管横向查询需要更长的时间才能执行，但这两个查询都运行良好.

only Transaction_date,Price and partition column tran_date needs to be updated in both the cases. Both queries running fine though the lateral taking longer time to execute.

分区表的执行计划为:

 Stage: Stage-5
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: transactionmainhistorytable
            filterExpr: tran_date is not null (type: boolean)
            Statistics: Num rows: 77 Data size: 39151 Basic stats: COMPLETE Column stats: COMPLETE
            Map Join Operator
              condition map:
                   Left Semi Join 0 to 1
              keys:
                0 tran_date (type: string)
                1 _col0 (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8

我在第二次查询时做错什么了吗?我是否需要同时使用两个partition列才能进行更好的修剪.任何帮助或建议，我们将不胜感激.

Am I doing something wrong with second query? Do I need to use both the partition column for better pruning. Any help or advice is greatly appreciated.

推荐答案

也许这不是一个完整的答案，但我希望这些想法会有所帮助.

Maybe this is not a complete answer but I hope these thoughts will be useful.

where tran_date IN (select ... )

实际上与

LEFT SEMI JOIN (SELECT ...)

这反映在计划中:

Map Join Operator
              condition map:
                   Left Semi Join 0 to 1
              keys:
                0 tran_date (type: string)
                1 _col0 (type: string)

它作为map-join执行.首先，选择子查询数据集，然后将其放置在分布式缓存中，并加载到内存中以供map-join使用.所有这些步骤:选择，加载到内存，映射联接比读取和覆盖所有表都要慢，因为它是如此之小且过度分区:统计数据显示行数:77数据大小:39151-太小，无法被两个分区列，甚至太小而根本无法分区.尝试使用更大的表，并使用EXPLAIN EXTENDED检查真正被扫描的内容.

And it is executed as map-join. First the subquery dataset is being selected, second it is placed in the distributed cache, loaded in memory to be used in the map-join. All these steps: select, load into memory, map-join are slower than read and overwrite all the table because it is so small and over-partitioned: statistics says Num rows: 77 Data size: 39151 - too small to be partitioned by two columns and even too small to be partitioned at all. Try bigger table and use EXPLAIN EXTENDED to check what is really being scanned.

此外，替换为:

from_unixtime(to_unix_timestamp (Transaction_date,'yyyy-MM-dd HH:mm'),'yyyy-MM-dd')

带有 substr(Transaction_date，0,10)或 date(Transaction_date)

和 substring(current_timestamp，0,10)和 current_date 只是为了简化代码.

And substring(current_timestamp,0,10) with current_date just to simplify the code a bit.

如果要在计划中显示分区过滤器，请尝试替换通过的分区过滤器作为分区列表，您可以在单独的会话中选择该分区过滤器，然后使用shell将分区列表传递到where子句中，请参见以下答案: https://stackoverflow.com/a/56963448/2700344

If you want partition filter displayed in the plan, try to substitute partition filter passed as a list of partitions which you can select in a separate session and use shell to pass the list of partitions into the where clause, see this answer: https://stackoverflow.com/a/56963448/2700344

这篇关于使用Hive分区表优化联接性能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Hive分区表优化联接性能 [英] Optimize the join performance with Hive partition table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用Hive分区表优化联接性能 [英] Optimize the join performance with Hive partition table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭