雅典娜:$ path与分区 [英] Athena: $path vs. partition

查看:65
本文介绍了雅典娜:$ path与分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要存储每个客户的每日报告,以供Athena查询.
起初我以为我会使用 client = c_1/month = 12/day = 01/ client = c2/date = 2020-12-01/文件夹结构,然后每天运行 MSCK REPAIR TABLE ,以使新的一天分区可用于查询.

I'm storing daily reports per client for query with Athena.
At first I thought I'd use a client=c_1/month=12/day=01/ or client=c2/date=2020-12-01/ folder structure, and run MSCK REPAIR TABLE daily to make new day partition available for query.

然后我意识到存在 $ path 特殊列,因此如果我将文件存储为 2020-12-01.csv ,则可以使用 WHERE运行查询$ path LIKE'%12-01%从而节省了分区,并且每天都需要检测/添加它.

Then I realized there's the $path special column, so if I store files as 2020-12-01.csv I could run a query with WHERE $path LIKE '%12-01% thus saving a partition and the need to detect/add it daily.

如果每天都有大量数据,我会发现这会对性能产生影响,但是在我的情况下, day 分区最多只能包含一个文件,因此一个分区主要是要有一个要查询的字段,而不是减少查询数据集.

I can see this having an impact on performance if there was a lot of daily data, But in my case the day partition will include one file at most, so a partition is mostly to have a field to query, not reduce query dataset.

还有其他缺点吗?

推荐答案

当前Athena并未对 $ path 应用任何优化,这意味着 WHERE"$之间没有有意义的区别.路径",例如'%12-01% WHERE"date" ='2020-12-01'(假设您有一列 date ,其中包含与文件名相同的日期).您的数据可能已经具有日期或日期时间列,并且使用查询比使用 $ path 更具可读性.

Currently Athena does not apply any optimisations for $path, which means that there is no meaningful difference between WHERE "$path" LIKE '%12-01% and WHERE "date" = '2020-12-01' (assuming you have a column date which contains the same date as the file name). Your data probably already has a date or datetime column, and your queries will be more readable using it rather than $path.

您肯定在正确的位置上询问您是否需要当前分区方案中的日期部分.在对数据集进行分区时,有许多不同的考虑因素,并且不详细分析情况就很难总是说出什么是对的.

You are definitely on the right track questioning whether or not you need the date part of your current partitioning scheme. There are lots of different considerations when partitioning data sets, and it's not easy to always say what is right without analysing the situation in detail.

我建议您使用某种基于时间的分区键.否则,您将无法限制查询读取的数据量,并且随着时间的流逝它们将变得更慢且更昂贵.对于您的用例,对日期进行分区可能太细粒度了,但可能需要一年或一个月.

I would recommend having some kind of time-based partition key. Otherwise you will have no way to limit the amount of data read by queries, and they will be slower and more expensive as time goes. Partitioning on date is probably too fine grained for your use case, but perhaps year or month would work.

但是,如果短时间内仅有一个客户端数据(总共少于一千个文件,一个S3列表页面的大小),或者查询总是读取一个客户端的所有数据,那么您不会不需要基于时间的分区键.

However, if there will only be data for a client for a short time (less than one thousand files in total, the size of one S3 listing page), or queries always read all the data for a client, you don't need a time-based partition key.

要对如何对数据进行分区进行更深入的分析,我需要更多地了解将要运行的查询的类型,如何更新数据,预期包含多少数据文件以及它们之间有多少差异将在客户端之间.

To do a deeper analysis on how to partition your data I would need to know more about the types of queries you will be running, how the data is updated, how much data files are expected to contain, and how much difference there will be from client to client.

这篇关于雅典娜:$ path与分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆