使用Athena查询S3 [英] Querying S3 using Athena

查看:138
本文介绍了使用Athena查询S3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个设置,其中Kinesis Firehose接收数据,AWS Lambda执行数据转换并将传入的数据放入S3存储桶中. S3结构按年/月/日/小时/小时.messages.json进行组织,因此我要查询的所有实际json文件都处于小时"级别,并且所有年,月,日目录仅包含子目录.

我的问题是我需要运行查询以获取给定日期的所有数据.有没有一种简单的方法可以在'day'目录级别进行查询并返回其子目录中的所有文件,而无需运行2020/06/15/00、2020/06/15/01、2020/06/15的查询/02 ... 2020/06/15/23?

我可以成功查询小时级别的目录,因为可以创建表并定义.json文件中表示的列名和类型,但是我不确定如何在Athena中创建表(如果可能)来表示包含子目录而不是实际文件的每日目录.

解决方案

要仅查询一天的数据,而又不让Athena读取所有天的所有数据,则需要创建一个分区投影,是一项新功能,您可以在TBLPROPERTIES部分中放置一些属性,这些属性会告诉Athena您的分区键以及如何查找数据–在这里,我告诉Athena假定从2020-06-01到现在S3上都存在数据.查询运行的时间(必要时调整开始日期),这意味着如果您指定一个在该时间之前或之后的现在"日期,雅典娜会知道没有这样的数据,甚至不会尝试读取那些日子的任何东西. storage.location.template属性告诉Athena在哪里可以找到特定日期的数据.如果您的查询指定了日期范围,例如… WHERE "date" > '2020/06/05'雅典娜将生成每个日期(由projection.date.interval属性控制)并读取s3://cszlos-data/is/here/2020-06-06s3://cszlos-data/is/here/2020-06-07等中的数据.

您可以找到完整的文档中的Kinesis Data Firehose示例.它显示了如何使用分区的全部小时粒度,但是您不希望如此,请遵循上面的示例.

传统方式

传统方式与上述方式类似,但是您必须手动添加分区才能让Athena找到它们.首先使用以下SQL创建表(同样,添加之前实验中的列,并修复S3位置):

CREATE EXTERNAL TABLE cszlos_firehose_data (
  -- fill in your columns here
)
PARTITIONED BY (
  `date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'

这与上面的SQL完全相同,但是没有表属性.如果您现在尝试对此表运行查询,那么您将不会获得任何结果.原因是您需要先向Athena告知分区表的分区,然后才能知道在何处查找数据(分区表必须具有LOCATION,但这实际上与常规表没有什么关系).

您可以通过多种不同的方式添加分区,但是交互式使用的最直接方法是使用 解决方案

To query only the data for a day without making Athena read all the data for all days you need to create a partitioned table (look at the second example). Partitioned tables are like regular tables, but they contain additional metadata that describes where the data for a particular combination of the partition keys is located. When you run a query and specify criteria for the partition keys Athena can figure out which locations to read and which to skip.

How to configure the partition keys for a table depends on the way the data is partitioned. In your case the partitioning is by time, and the timestamp has hourly granularity. You can choose a number of different ways to encode this partitioning in a table, which one is the best depends on what kinds of queries you are going to run. You say you want to query by day, which makes sense, and will work great in this case.

There are two ways to set this up, the traditional, and the new way. The new way uses a feature that was released just a couple of days ago and if you try to find more examples of it you may not find many, so I'm going to show you the traditional too.

Using Partition Projection

Use the following SQL to create your table (you have to fill in the columns yourself, since you say you've successfully created a table already just use the columns from that table – also fix the S3 locations):

CREATE EXTERNAL TABLE cszlos_firehose_data (
  -- fill in your columns here
)
PARTITIONED BY (
  `date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
TBLPROPERTIES (
  "projection.enabled" = "true",
  "projection.date.type" = "date",
  "projection.date.range" = "2020/06/01,NOW",
  "projection.date.format" = "yyyy/MM/dd",
  "projection.date.interval" = "1",
  "projection.date.interval.unit" = "DAYS",
  "storage.location.template" = "s3://cszlos-data/is/here/${date}"
)

This creates a table partitioned by date (please note that you need to quote this in queries, e.g. SELECT * FROM cszlos_firehose_data WHERE "date" = …, since it's a reserved word, if you want to avoid having to quote it use another name, dt seems popular, also note that it's escaped with backticks in DDL and with double quotes in DML statements). When you query this table and specify a criteria for date, e.g. … WHERE "date" = '2020/06/05', Athena will read only the data for the specified date.

The table uses Partition Projection, which is a new feature where you put properties in the TBLPROPERTIES section that tell Athena about your partition keys and how to find the data – here I'm telling Athena to assume that there exists data on S3 from 2020-06-01 up until the time the query runs (adjust the start date necessary), which means that if you specify a date before that time, or after "now" Athena will know that there is no such data and not even try to read anything for those days. The storage.location.template property tells Athena where to find the data for a specific date. If your query specifies a range of dates, e.g. … WHERE "date" > '2020/06/05' Athena will generate each date (controlled by the projection.date.interval property) and read data in s3://cszlos-data/is/here/2020-06-06, s3://cszlos-data/is/here/2020-06-07, etc.

You can find a full Kinesis Data Firehose example in the docs. It shows how to use the full hourly granularity of the partitioning, but you don't want that so stick to the example above.

The traditional way

The traditional way is similar to the above, but you have to add partitions manually for Athena to find them. Start by creating the table using the following SQL (again, add the columns from your previous experiments, and fix the S3 locations):

CREATE EXTERNAL TABLE cszlos_firehose_data (
  -- fill in your columns here
)
PARTITIONED BY (
  `date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'

This is exactly the same SQL as above, but without the table properties. If you try to run a query against this table now you will not get any results. The reason is that you need to tell Athena about the partitions of a partitioned table before it knows where to look for data (partitioned tables must have a LOCATION, but it really doesn't mean the same thing as for regular tables).

You can add partitions in many different ways, but the most straight forward for interactive use is to use ALTER TABLE ADD PARTITION. You can add multiple partitions in one statement, like this:

ALTER TABLE cszlos_firehose_data ADD
PARTITION (`date` =  '2020-06-06') LOCATION 's3://cszlos-data/is/here/2020/06/06'
PARTITION (`date` =  '2020-06-07') LOCATION 's3://cszlos-data/is/here/2020/06/07'
PARTITION (`date` =  '2020-06-08') LOCATION 's3://cszlos-data/is/here/2020/06/08'
PARTITION (`date` =  '2020-06-09') LOCATION 's3://cszlos-data/is/here/2020/06/09'

If you start reading more about partitioned tables you will probably also run across the MSCK REPAIR TABLE statement as a way to load partitions. This command is unfortunately really slow, and it only works for Hive style partitioned data (e.g. …/year=2020/month=06/day=07/file.json) – so you can't use it.

这篇关于使用Athena查询S3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆