AWS Athena - 查询分区中不同年份的数据 [英] AWS Athena - Query data from different years in partitions

查看:29
本文介绍了AWS Athena - 查询分区中不同年份的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在 S3 中分区了大型数据集,例如 s3://bucket/year=YYYY/month=MM/day=DD/file.csv.

We have large datasets partitioned in S3 like s3://bucket/year=YYYY/month=MM/day=DD/file.csv.

在 Athena 中查询不同年份的数据并利用分区的最佳方法是什么?

What would be the best way to query the data in Athena from different years and take advantage of the partitioning ?

以下是我对 2018-03-07 至 2020-03-06 数据的尝试:

Here's what I tried for data from 2018-03-07 to 2020-03-06:

查询 1 - 在我取消之前运行 2 分 45 秒

Query 1 - running for 2min 45s before I cancel

SELECT dt, col1, col2
FROM mytable
WHERE year BETWEEN '2018' AND '2020'
AND dt BETWEEN '2018-03-07' AND '2020-03-06'
ORDER BY dt

查询 2 - 运行大约 2 分钟.但是我认为如果时间段是从 2005 年到 2020 年,效率会很低

Query 2 - run for about 2min. However I don't think it would be efficient if the period were from for example 2005 to 2020

SELECT dt, col1, col2
FROM mytable
WHERE (year = '2018' AND month >= '03' AND dt >= '2018-03-07')
OR year = '2019' OR (year = '2020' AND month <= '03' AND dt <= '2020-03-06')
ORDER BY dt

推荐答案

我建议仅按 dt (yyyy-MM-dd) 而不是 yearmonth, day,这很简单,分区修剪会起作用,尽管使用仅使用年份过滤器的查询如 where year>'2020' 应重写为 dt>'2020-01-01' 等等.

I would suggest to repartition table by dt only (yyyy-MM-dd) instead of year, month, day, this is simple and partition pruning will work, though queries using year only filter like where year>'2020' should be rewritten as dt>'2020-01-01' and so on.

顺便说一句,Hive 分区修剪中的查询也适用于这样的查询:

Also BTW in Hive partition pruning works fine with queries like this:

where concat(year, '-', month, '-', day) >= '2018-03-07'
      and 
      concat(year, '-', month, '-', day) <= '2020-03-06'

我无法检查是否在 Presto 中执行相同的工作,但值得一试.您可以使用 || 运算符代替 concat().

I cant check does the same works in Presto or not but it worth trying. You can use || operator instead of concat().

这篇关于AWS Athena - 查询分区中不同年份的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆