使用聚合函数时,减少雅典娜扫描的数据量 [英] reduce the amount of data scanned by Athena when using aggregate functions

查看:92
本文介绍了使用聚合函数时,减少雅典娜扫描的数据量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的查询扫描100 mb的数据。

The below query scans 100 mb of data.

select * from table where column1 = 'val' and partition_id = '20190309';

但是下面的查询扫描了15 GB的数据(有超过90个分区)

However the below query scans 15 GB of data (there are over 90 partitions)

select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);

如何优化第二个查询以扫描与第一个查询相同的数据量?

How can I optimize the second query to scan the same amount of data as the first?

推荐答案

这里有两个问题。 上方的标量子查询的效率从表中选择max(partition_id),而@PiotrFindeisen指出了动态过滤。

There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table, and the one @PiotrFindeisen pointed out around dynamic filtering.

第一个问题是,对Hive表的分区键的查询要比看起来复杂得多。大多数人会认为,如果您想要分区键的最大值,则可以简单地对分区键执行查询,但这是行不通的,因为Hive允许分区为空(并且还允许非空文件不包含任何行)。具体来说,在从表中选择max(partition_id)上方的标量子查询要求Presto查找包含至少一行的最大分区。理想的解决方案是在Hive中具有完善的统计信息,但是引擎还需要具有用于Hive的自定义逻辑,以打开分区文件,直到找到一个非空的分区为止。

The the first problem is that queries over the partition keys of a Hive table are a lot more complex than they appear. Most folks would think that if you want the max value of a partition key, you can simply execute a query over the partition keys, but that doesn't work because Hive allows partitions to be empty (and it also allows non-empty files that contain no rows). Specifically, the scalar subquery above select max(partition_id) from table requires Presto to find the max partition containing at least one row. The ideal solution would be to have perfect stats in Hive, but short of that the engine would need to have custom logic for hive that open files of the partitions until it found a non empty one.

如果您确定仓库中不包含空分区(或者您可以接受其中的含义),则可以将标量子查询替换为隐藏的 $中的一个分区

If you are are sure that your warehouse does not contain empty partitions (or if you are ok with the implications of that), you can replace the scalar sub query with one over the hidden $partitions table"

select * 
from table 
where column1 = 'val' and 
    partition_id = (select max(partition_id) from "table$partitions");

第二个问题是@PiotrFindeisen指出的问题,它与查询被计划执行的方式有关,大多数人会看上面的查询,发现引擎显然应该找出<$ c的值$ c>在计划过程中从 table $ partitions中选择max(partition_id),将其内联到计划中,然后继续进行优化。不幸的是,这通常是一个相当复杂的决定,因此引擎将其简单地建模为广播联接,其中执行的一部分会找出该值,并将该值广播给其他工人。问题在于执行的其余部分无法将此新信息添加到现有处理中,因此它仅扫描所有数据,然后滤除您要跳过的值。目前有一个项目正在添加此 动态过滤 ,但是还没有完成。

The second problem is the one @PiotrFindeisen pointed out, and has to do with the way that queries are planned an executed. Most people would look at the above query, see that the engine should obviously figure out the value of select max(partition_id) from "table$partitions" during planning, inline that into the plan, and then continue with optimization. Unfortunately, that is a pretty complex decision to make generically, so the engine instead simply models this as a broadcast join, where one part of the execution figures out that value, and broadcasts the value to the rest of the workers. The problem is the rest of the execution has no way to add this new information into the existing processing, so it simply scans all of the data and then filters out the values you are trying to skip. There is a project in progress to add this dynamic filtering, but it is not complete yet.

这意味着您今天可以做的最好的事情是运行两个单独的查询:一个查询获取最大partition_id,第二个查询带有内联值。

This means the best you can do today, is to run two separate queries: one to get the max partition_id and a second one with the inlined value.

BTW,隐藏的 $ partitions表已添加到Presto 0.199 ,我们修复了 0.201中的一些小错误。 。我不确定Athena所基于的版本,但我认为它已经过时了(我撰写此答案时的当前版本为 309

BTW, the hidden "$partitions" table was added in Presto 0.199, and we fixed some minor bugs in 0.201. I'm not sure which version Athena is based on, but I believe it is is pretty far out of date (the current release at the time I'm writing this answer is 309.

这篇关于使用聚合函数时,减少雅典娜扫描的数据量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆