雅典娜在S3上查询数据的替代方法 [英] Alternatives for Athena to query the data on S3

查看:151
本文介绍了雅典娜在S3上查询数据的替代方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 S3 上大约有 300 GB 数据。可以说数据看起来像这样:

  ## S3://Bucket/Country/Month/Day/1.csv 

S3://国家/德国/06/01/1.csv
S3://国家/德国/06/6/1.csv
S3://国家/德国/06/01/3.csv

S3://Countries/Germany/06/02/1.csv
S3:// Countries / Germany / 06/02/2。 csv

我们正在对数据进行复杂聚合,因为国家/地区的数据很大而某些国家/地区的数据很小, AWS EMR 没什么用处,因为一旦完成了小国的建设,就浪费了资源,大国也为之奋斗很久。因此,我们决定在 Athena 中使用 AWS Batch(Docker容器)。每个国家/地区有一天需要执行一项工作。



现在大约有 1000个工作,它们一起开始并且在他们查询雅典娜时读取数据,容器失败,因为它们达到了 Athena查询限制



因此,我想知道还有什么其他方法可以解决此问题?我应该使用 Redshift集群,因为那里没有查询限制,所以将所有数据加载到其中,并将所有容器查询到Redshift集群。



另一种选择是在 EMR 上读取数据并使用<

如果有人可以给它一个强大的Hive或Presto 来查询数据,但同样会达到查询限制。

更好的选择来解决这个问题。

解决方案

一个解决方案是不要同时启动所有作业,而要使其步调为保持在并发限制之内。我不知道这对您使用的工具而言是简单还是困难,但是如果您同时将所有查询放在Athena上,则永远无法奏效。 编辑:看来您应该能够批量控制作业,请参见(默认情况下,Athena允许25个并发查询,因此请尝试20个并发作业以确保安全性–还要在启动工作的代码)。



另一种选择是不将其作为单独的查询来执行,而是尝试将所有内容合并为更少的查询,甚至单个查询。通过对国家和日期进行分组,或生成所有查询并将其与 UNION ALL 粘合在一起。不过,如果不知道更多有关数据和查询的信息,很难说是否可行。无论如何,您可能都必须对结果进行后处理,并且如果您仅按有意义的顺序进行排序,那么在查询运行后将结果分成所需的部分就不会很困难。



使用Redshift可能不是解决方案,因为听起来您每天只执行一次,并且不会使用太多群集。雅典娜将是一个更好的选择,您只需要更好地处理这些限制即可。



由于我对您的用例的了解有限,我认为使用Lambda和Step Functions将是最好的选择。比批处理更好的方法。使用Step Functions,您将有一个函数启动N个查询(其中N等于您的并发限制,如果不要求提高它,则为25),然后是一个轮询循环(请查看示例以了解如何执行此操作),以检查已完成的查询,并启动新查询,以将正在运行的查询数保持在最大值。当所有查询都运行时,最终功能可以触发您在完成所有操作后需要运行的任何工作流(或者您可以在每次查询后运行该工作流)。



Lambda的好处而Step Functions则是您不用为空闲资源付费。使用Batch,您将为无所事事的资源付费,只需等待雅典娜完成。由于Athena与Redshift相比具有异步API,因此您可以运行Lambda函数100毫秒以启动查询,然后每隔几秒钟(或几分钟)运行100毫秒以检查是否已完成,然后再运行100毫秒左右来完成起来几乎可以保证它会少于Lambda免费套餐。


I have around 300 GBs of data on S3. Lets say the data look like:

## S3://Bucket/Country/Month/Day/1.csv 

S3://Countries/Germany/06/01/1.csv 
S3://Countries/Germany/06/01/2.csv 
S3://Countries/Germany/06/01/3.csv 

S3://Countries/Germany/06/02/1.csv 
S3://Countries/Germany/06/02/2.csv 

We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the resources are being wasted, and the big countries keep running for long time. Therefore, we decided to use AWS Batch (Docker container) with Athena. One job works on one day of data per country.

Now there are roughly 1000 jobs which starts together and when they query Athena to read the data, containers failed because they reached Athena query limits.

Therefore, I would like to know what are the other possible ways to tackle this problem? Should I use Redshift cluster, load all the data there and all the containers query to Redshift cluster as they don't have query limitations. But it is expensive, and takes a lot of time to wramp up.

The other option would be to read data on EMR and use Hive or Presto on top of it to query the data, but again it will reach the query limitation.

It would be great if someone can give better options to tackle this problem.

解决方案

One solution would be to not launch all jobs at the same time, but pace them to stay within the concurrency limits. I don't know if this is easy or hard with the tools you're using, but it's never going to work out well if you throw all the queries at Athena at the same time. Edit: it looks like you should be able to throttle jobs in Batch, see AWS batch - how to limit number of concurrent jobs (by default Athena allows 25 concurrent queries, so try 20 concurrent jobs to have a safety margin – but also add retry logic to the code that launches the job).

Another option would be to not do it as separate queries, but try to bake everything together into fewer, or even a single query – either by grouping on country and date, or by generating all queries and gluing them together with UNION ALL. If this is possible or not is hard to say without knowing more about the data and the query, though. You'll likely have to post-process the result anyway, and if you just sort by something meaningful it wouldn't be very hard to split the result into the necessary pieces after the query has run.

Using Redshift is probably not the solution, since it sounds like you're doing this only once per day, and you wouldn't use the cluster very much. It would Athena is a much better choice, you just have to handle the limits better.

With my limited understanding of your use case I think using Lambda and Step Functions would be a better way to go than Batch. With Step Functions you'd have one function that starts N number of queries (where N is equal to your concurrency limit, 25 if you haven't asked for it to be raised), and then a poll loop (check the examples for how to do this) that checks queries that have completed, and starts new queries to keep the number of running queries at the max. When all queries are run a final function can trigger whatever workflow you need to run after everything is done (or you can run that after each query).

The benefit of Lambda and Step Functions is that you don't pay for idle resources. With Batch, you will pay for resources that do nothing but wait for Athena to complete. Since Athena, in contrast to Redshift for example, has an asynchronous API you can run a Lambda function for 100ms to start queries, then 100ms every few seconds (or minutes) to check if any have completed, and then another 100ms or so to finish up. It's almost guaranteed to be less than the Lambda free tier.

这篇关于雅典娜在S3上查询数据的替代方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆