按 S3 创建日期分区 Athena 查询 [英] Partition Athena query by S3 created date

查看:50
本文介绍了按 S3 创建日期分区 Athena 查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含约 7000 万个 JSON(约 15TB)的 S3 存储桶和一个 athena 表,用于按时间戳和 JSON 中定义的其他一些键进行查询.

I have a S3 bucket with ~ 70 million JSONs (~ 15TB) and an athena table to query by timestamp and some other keys definied in the JSON.

可以保证,JSON 中的时间戳或多或少等于 JSON 的 S3-createdDate(或至少足以满足我的查询目的)

It is guaranteed, that the timestamp in the JSON is more or less equal to the S3-createdDate of the JSON (or at least equal enough for the purpose of my query)

我能否通过将 createddate 添加为分区"之类的东西来以某种方式提高查询性能(和成本)——我不明白这似乎只适用于前缀/文件夹?

Can I somehow improve querying-performance (and cost) by adding the createddate as something like a "partition" - which I unterstand seems only to be possible for prefixes/folders?

我目前通过使用 S3 库存 CSV 按 createdDate 进行预过滤来模拟,然后下载所有 JSON 并进行其余的过滤,但如果可能的话,我想完全在 athena 内完成此操作

edit: I currently simulate that by using the S3 inventory CSV to pre-filter by createdDate and then download all JSONs and do the rest of the filtering, but I'd like to do that completely inside athena, if possible

推荐答案

没有办法让 Athena 使用诸如 S3 对象元数据之类的东西来进行查询计划.使 Athena 跳过读取对象的唯一方法是以可以设置分区表的方式组织对象,然后使用分区键上的过滤器进行查询.

There is no way to make Athena use things like S3 object metadata for query planning. The only way to make Athena skip reading objects is to organize the objects in a way that makes it possible to set up a partitioned table, and then query with filters on the partition keys.

您似乎知道如何在 Athena 中进行分区 有效,我认为您不使用它是有原因的.但是,为了其他遇到类似问题的人遇到这个问题,我将首先解释如果您可以更改对象的组织方式,您可以做什么.最后我会给出一个替代建议,你可能想直接跳到那个.

It sounds like you have an idea of how partitioning in Athena works, and I assume there is a reason that you are not using it. However, for the benefit of others with similar problems coming across this question I'll start by explaining what you can do if you can change the way the objects are organized. I'll give an alternative suggestion at the end, you may want to jump straight to that.

我建议您使用包含对象时间戳的某些部分的前缀来组织 JSON 对象.具体多少取决于您查询数据的方式.你不希望它太细,也不要太粗糙.太细化会使 Athena 花更多时间在 S3 上列出文件,太粗化会使其读取太多文件.如果查询最常见的时间段是一个月,这是一个很好的粒度,如果最常见的时间段是几天,那么一天可能更好.

I would suggest you organize the JSON objects using prefixes that contain some part of the timestamps of the objects. Exactly how much depends on the way you query the data. You don't want it too granular and not too coarse. Making it too granular will make Athena spend more time listing files on S3, making it too coarse will make it read too many files. If the most common time period of queries is a month, that is a good granularity, if the most common period is a couple of days then day is probably better.

例如,如果日期是数据集的最佳粒度,您可以使用如下键来组织对象:

For example, if day is the best granularity for your dataset you could organize the objects using keys like this:

s3://some-bucket/data/2019-03-07/object0.json
s3://some-bucket/data/2019-03-07/object1.json
s3://some-bucket/data/2019-03-08/object0.json
s3://some-bucket/data/2019-03-08/object1.json
s3://some-bucket/data/2019-03-08/object2.json

您还可以使用 Hive 样式的分区方案,这是 Glue、Spark 和 Hive 等其他工具所期望的,因此除非您有理由不这样做,否则将来可以省去您的痛苦:

You can also use a Hive-style partitioning scheme, which is what other tools like Glue, Spark, and Hive expect, so unless you have reasons not to it can save you grief in the future:

s3://some-bucket/data/created_date=2019-03-07/object0.json
s3://some-bucket/data/created_date=2019-03-07/object1.json
s3://some-bucket/data/created_date=2019-03-08/object0.json

我在这里选择了名称 created_date,我不知道您的数据使用什么名称比较好.您可以只使用 date,但请记住始终引用它(并在 DML 和 DDL 中以不同的方式引用它...),因为它是一个保留字.

I chose the name created_date here, I don't know what would be a good name for your data. You can use just date, but remember to always quote it (and quote it in different ways in DML and DDL…) since it's a reserved word.

然后你创建一个分区表:

Then you create a partitioned table:

CREATE TABLE my_data (
  column0 string,
  column1 int
)
PARTITIONED BY (created_date date)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false')

一些指南会告诉您运行 MSCK REPAIR TABLE 来加载表的分区.如果您使用 Hive 样式的分区(即 .../created_date=2019-03-08/...),您可以执行此操作,但需要很长时间,我不建议这样做.您可以通过手动添加分区来做得更好,您可以这样做:

Some guides will then tell you to run MSCK REPAIR TABLE to load the partitions for the table. If you use Hive-style partitioning (i.e. …/created_date=2019-03-08/…) you can do this, but it will take a long time and I wouldn't recommend it. You can do a much better job of it by manually adding the partitions, which you do like this:

ALTER TABLE my_data ADD
  PARTITION (created_date = '2019-03-07') LOCATION 's3://some-bucket/data/created_date=2019-03-07/'
  PARTITION (created_date = '2019-03-08') LOCATION 's3://some-bucket/data/created_date=2019-03-08/'

最后,当您查询表时,请确保包含 created_date 列,以便为 Athena 提供读取与查询相关的对象所需的信息:

Finally, when you query the table make sure to include the created_date column to give Athena the information it needs to read only the objects that are relevant for the query:

SELECT COUNT(*)
FROM my_data
WHERE created_date >= DATE '2019-03-07'

您可以通过观察从例如 created_date >= DATE '2019-03-07' 更改为 created_date 时扫描数据的差异来验证查询是否会更便宜= 日期'2019-03-07'.

You can verify that the query will be cheaper by observing the difference in the data scanned when you change from for example created_date >= DATE '2019-03-07' to created_date = DATE '2019-03-07'.

如果您无法更改对象在 S3 上的组织方式,则有一个记录不足的功能,即使您无法更改数据对象,也可以创建分区表.您所做的是创建与我上面建议的相同的前缀,但不是将 JSON 对象移动到此结构中,而是在每个分区的前缀中放置一个名为 symlink.txt 的文件:

If you are not able to change the way the objects are organized on S3, there is a poorly documented feature that makes it possible to create a partitioned table even when you can't change the data objects. What you do is you create the same prefixes as I suggest above, but instead of moving the JSON objects into this structure you put a file called symlink.txt in each partition's prefix:

s3://some-bucket/data/created_date=2019-03-07/symlink.txt
s3://some-bucket/data/created_date=2019-03-08/symlink.txt

在每个 symlink.txt 中,您将要包含在该分区中的文件的完整 S3 URI 放入.例如,在第一个文件中,您可以放置​​:

In each symlink.txt you put the full S3 URI of the files that you want to include in that partition. For example, in the first file you could put:

s3://data-bucket/data/object0.json
s3://data-bucket/data/object1.json

和第二个文件:

s3://data-bucket/data/object2.json
s3://data-bucket/data/object3.json
s3://data-bucket/data/object4.json

然后创建一个与上表非常相似的表,但有一点不同:

Then you create a table that looks very similar to the table above, but with one small difference:

CREATE TABLE my_data (
  column0 string,
  column1 int
)
PARTITIONED BY (created_date date)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false')

注意 INPUTFORMAT 属性的值.

您可以像添加任何分区表一样添加分区:

You add partitions just like you do for any partitioned table:

ALTER TABLE my_data ADD
  PARTITION (created_date = '2019-03-07') LOCATION 's3://some-bucket/data/created_date=2019-03-07/'
  PARTITION (created_date = '2019-03-08') LOCATION 's3://some-bucket/data/created_date=2019-03-08/'

我为此遇到的有关此功能的唯一与 Athena 相关的文档是 用于与 Athena 集成的 S3 Inventory 文档.

The only Athena-related documentation of this feature that I have come across for this is the S3 Inventory docs for integrating with Athena.

这篇关于按 S3 创建日期分区 Athena 查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆