Hive-有没有办法进一步优化HiveQL查询? [英] Hive - Is there a way to further optimize a HiveQL query?
问题描述
我写了一个查询,以查找3月至4月美国10个最繁忙的机场.它产生所需的输出,但是我想尝试进一步优化它.
I have written a query to find 10 most busy airports in the USA from March to April. It produces the desired output however I want to try to further optimize it.
是否有任何适用于查询的HiveQL特定优化? GROUPING SETS
在这里适用吗?我是Hive的新手,现在这是我提出的最短的查询.
Are there any HiveQL specific optimizations that can be applied to the query? Is GROUPING SETS
applicable here? I'm new to Hive and for now this is the shortest query that I've come up with.
SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;
表列如下:
机场
|iata|airport|city|state|country|
Flights_stats
Flights_stats
|originAirport|destAirport|FlightsNum|Cancelled|Month|
推荐答案
按机场(内部联接)进行过滤,并在UNION ALL之前进行聚合,以减少传递到最终聚合简化程序的数据集.具有联接的UNION ALL子查询应比具有UNION ALL之后的较大数据集的联接并行且运行速度更快.
Filter by airport(inner join) and do aggregation before UNION ALL to reduce dataset passed to the final aggregation reducer. UNION ALL subqueries with joins should run in parallel and faster than join with bigger dataset after UNION ALL.
SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
UNION ALL
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
) f
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;
调整mapjoin并启用并行执行:
Tune mapjoins and enable parallel execution:
set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory
使用Tez并向量化,调整映射器和化简器的并行度: https://stackoverflow.com/a/48487306/2700344
Use Tez and vectorizing, tune mappers and reducers parallelism: https://stackoverflow.com/a/48487306/2700344
这篇关于Hive-有没有办法进一步优化HiveQL查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!