Hive-有没有办法进一步优化HiveQL查询? [英] Hive - Is there a way to further optimize a HiveQL query?

查看:88
本文介绍了Hive-有没有办法进一步优化HiveQL查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个查询,以查找3月至4月美国10个最繁忙的机场.它产生所需的输出,但是我想尝试进一步优化它.

I have written a query to find 10 most busy airports in the USA from March to April. It produces the desired output however I want to try to further optimize it.

是否有任何适用于查询的HiveQL特定优化? GROUPING SETS在这里适用吗?我是Hive的新手,现在这是我提出的最短的查询.

Are there any HiveQL specific optimizations that can be applied to the query? Is GROUPING SETS applicable here? I'm new to Hive and for now this is the shortest query that I've come up with.

SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;

表列如下:

机场

|iata|airport|city|state|country|

Flights_stats

Flights_stats

|originAirport|destAirport|FlightsNum|Cancelled|Month|

推荐答案

按机场(内部联接)进行过滤,并在UNION ALL之前进行聚合,以减少传递到最终聚合简化程序的数据集.具有联接的UNION ALL子查询应比具有UNION ALL之后的较大数据集的联接并行且运行速度更快.

Filter by airport(inner join) and do aggregation before UNION ALL to reduce dataset passed to the final aggregation reducer. UNION ALL subqueries with joins should run in parallel and faster than join with bigger dataset after UNION ALL.

SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
      SELECT a.airport, COUNT(*) as cnt 
       FROM flights_stats f
            INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
       UNION ALL
      SELECT a.airport, COUNT(*) as cnt
       FROM flights_stats f
            INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
     ) f 
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;

调整mapjoin并启用并行执行:

Tune mapjoins and enable parallel execution:

set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory

使用Tez并向量化,调整映射器和化简器的并行度: https://stackoverflow.com/a/48487306/2700344

Use Tez and vectorizing, tune mappers and reducers parallelism: https://stackoverflow.com/a/48487306/2700344

这篇关于Hive-有没有办法进一步优化HiveQL查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆