AWS Athena对于api来说太慢了? [英] AWS Athena too slow for an api?

查看:163
本文介绍了AWS Athena对于api来说太慢了?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

计划是从aws数据交换中获取数据,将其移至s3存储桶,然后由aws athena查询以获取数据api.一切正常,只是感觉有点慢.

The plan was to get data from aws data exchange, move it to an s3 bucket then query it by aws athena for a data api. Everything works, just feels a bit slow.

无论数据集还是查询,我在雅典娜的响应时间都无法低于2秒.对于API来说,这很多.我检查了最佳做法,但似乎也超过了2秒.

No matter the dataset nor the query I can't get below 2 second in athena response time. Which is a lot for an API. I checked the best practices but seems that those are also above 2 sec.

所以我的问题是: 2秒是雅典娜的最短响应时间吗?

So my question: Is 2 sec the minimal response time for athena?

如果是这样,那么我必须切换到postgres.

If so then I have to switch to postgres.

推荐答案

Athena确实不是低延迟的数据存储.您很少会看到响应时间低于一秒,而且响应时间通常会更长.在一般情况下,Athena不适合用作API的后端,但是当然取决于它是哪种API.如果是某种分析服务,也许用户不希望得到亚秒级的响应时间?我已经构建了使用Athena的API,这些API确实可以很好地工作,但是那些服务的响应时间在几秒钟之内(甚至可以说是很快),因此我得到了Athena团队的帮助,以将我们的帐户调整为工作量.

Athena is indeed not a low latency data store. You will very rarely see response times below one second, and often they will be considerably longer. In the general case Athena is not suitable as a backend for an API, but of course that depends on what kind of an API it is. If it's some kind of analytics service, perhaps users don't expect sub second response times? I have built APIs that use Athena that work really well, but those were services where response times in seconds were expected (and even considered fast), and I got help from the Athena team to tune our account to our workload.

要了解Athena为什么慢",我们可以剖析当您向Athena提交查询时会发生什么:

To understand why Athena is "slow", we can dissect what happens when you submit a query to Athena:

  1. 您的代码通过使用StartQueryExecution API调用
  2. 启动查询
  3. Athena服务接收查询,并将其放入队列中.如果您不走运,您的查询会在队列中停留一会儿
  4. 当有可用容量时,Athena服务将从队列中查询您的查询并制定查询计划
  5. 查询计划要求为查询中包含的所有表从Glue目录中加载表元数据,包括分区列表
  6. Athena还会列出从表和分区获取的S3上的所有位置,以生成将要处理的文件的完整列表
  7. 然后根据计划的复杂性,分多个步骤并行执行计划
  8. 合并并行执行的结果,并将结果序列化为CSV并将其写入S3
  9. 同时,您的代码使用GetQueryExecution API调用检查查询是否已完成,直到得到响应,表明执行成功,失败或被取消
  10. 如果执行成功,您的代码将使用GetQueryResults API调用来检索结果的第一页
  11. 为响应该API调用,Athena从S3读取结果CSV,对其进行反序列化,然后将其序列化为JSON以用于API响应
  12. 如果行数超过1000,则将重复最后的步骤
  1. Your code starts a query by using the StartQueryExecution API call
  2. The Athena service receives the query, and puts it on a queue. If you're unlucky your query will sit in the queue for a while
  3. When there is available capacity the Athena service takes your query from the queue and makes a query plan
  4. The query plan requires loading table metadata from the Glue catalog, including the list of partitions, for all tables included in the query
  5. Athena also lists all the locations on S3 it got from the tables and partitions to produce a full list of files that will be processed
  6. The plan is then executed in parallel, and depending on its complexity, in multiple steps
  7. The results of the parallel executions are combined and a result is serialized as CSV and written to S3
  8. Meanwhile your code checks if the query has completed using the GetQueryExecution API call, until it gets a response that says that the execution has succeeded, failed, or been cancelled
  9. If the execution succeeded your code uses the GetQueryResults API call to retrieve the first page of results
  10. To respond to that API call, Athena reads the result CSV from S3, deserializes it, and serializes it as JSON for the API response
  11. If there are more than 1000 rows the last steps will be repeated

Presto专家可能会提供有关步骤4-6的更多详细信息,即使在雅典娜的Presto版本中可能对其进行了一些修改.不过,细节对于本次讨论不是很重要.

A Presto expert could probably give more detail about steps 4-6, even though they are probably a bit modified in Athena's version of Presto. The details aren't very important for this discussion though.

如果对大量数据(数十GB或更多)运行查询,则总执行时间将由步骤6决定.如果结果也很大,则7将成为一个因素.

If you run a query over a lot of data, tens of gigabytes or more, the total execution time will be dominated by step 6. If the result is also big, 7 will be a factor.

如果您的数据集很小,并且/或者涉及S3上的数千个文件,那么4-5将占主导地位.

If your data set is small, and/or involves thousands of files on S3, then 4-5 will instead dominate.

以下是Athena查询永远无法快速运行的一些原因,即使它们不会碰到S3(例如SELECT NOW()):

Here are some reasons why Athena queries can never be fast, even if they wouldn't touch S3 (for example SELECT NOW()):

  • 在获得响应之前,至少会有三个API调用,分别是StartQueryExecutionGetQueryExecutionGetQueryResults,仅它们的往返时间(RTT)总计将超过100ms./li>
  • 您很可能必须多次调用GetQueryExecution,并且两次调用之间的延迟将限制您发现查询成功的速度,例如如果每100毫秒调用一次,则平均将使总时间增加100毫秒+ RTT的一半,因为平均而言,您将错过实际的完成时间.
  • Athena会将结果标记为成功之前将结果写入S3,并且由于它会生成单个CSV文件,因此不会并行执行.做出巨大的回应需要花费一些时间来写.
  • GetQueryResults必须从S3中读取CSV,对其进行解析并将其序列化为JSON.后续页面必须在CSV中向前跳过,甚至可能更慢.
  • Athena是一项多租户服务,所有客户都在争夺资源,如果没有足够的可用资源,您的查询就会排队.
  • There will at least be three API calls before you get the response, a StartQueryExecution, a GetQueryExecution, and a GetQueryResults, just their round trip time (RTT) would add up to more than 100ms.
  • You will most likely have to call GetQueryExecution multiple times, and the delay between calls will puts a bound on how quickly you can discover that the query has succeeded, e.g. if you call it every 100ms you will on average add half of 100ms + RTT to the total time because on average you'll miss the actual completion time by this much.
  • Athena will writes the results to S3 before it marks the execution as succeeded, and since it produces a single CSV file this is not done in parallel. A big response takes time to write.
  • The GetQueryResults must read the CSV from S3, parse it and serialize it as JSON. Subsequent pages must skip ahead in the CSV, and may be even slower.
  • Athena is a multi tenant service, all customers are competing for resources, and your queries will get queued when there aren't enough resources available.

如果您想知道什么会影响查询的性能,可以使用ListQueryExecutions API调用列出最近的查询执行ID(我认为您最多可以追溯90天),然后使用GetQueryExecution获取查询统计信息(请参见 QueryExecution.Statistics 文档每个属性的含义).有了这些信息,您可以找出查询缓慢的原因是由于排队,执行还是进行API调用的开销(如果不是前两个,则可能是最后一个).

If you want to know what affects the performance of your queries you can use the ListQueryExecutions API call to list recent query execution IDs (I think you can go back 90 days at the most), and then use GetQueryExecution to get query statistics (see the documentation for QueryExecution.Statistics for what each property means). With this information you can figure out if your slow queries are because of queueing, execution, or the overhead of making the API calls (if it's not the first two, it's likely the last).

您可以采取一些措施来减少一些延迟,但是这些技巧不太可能使您延迟到亚秒级以下:

There are some things you can do to cut some of the delays, but these tips are unlikely to get you down to sub second latencies:

  • 如果您查询大量针对此类事情而优化的数据使用文件格式,Parquet几乎总是答案–并且还要确保您的文件大小是最佳的,大约100 MB.
  • 避免大量文件,并避免深层次结构.理想情况下,每个分区只有一个或几个文件,除了与分区相对应的文件外,不要在子目录"(带斜线的S3前缀)中组织文件.
  • 避免在一个小时的头上运行查询,这是当其他人的预定作业都在运行时,每小时的第一分钟对资源的争用很大.
  • 跳过GetQueryExecution,直接从S3下载CSV.如果您想知道列的数据类型,使用GetQueryExecution调用会很方便,但是如果您已经知道或不在乎,直接读取数据可以节省一些宝贵的毫秒.如果需要列数据类型,则可以获取与结果CSV一起写入的….csv.metadata文件,它是未记录的Protobuf数据,请参见
  • If you query a lot of data use file formats that are optimized for that kind of thing, Parquet is almost always the answer – and also make sure your file sizes are optimal, around 100 MB.
  • Avoid lots of files, and avoid deep hierarchies. Ideally have just one or a few files per partition, and don't organize files in "subdirectories" (S3 prefixes with slashes) except for those corresponding to partitions.
  • Avoid running queries at the top of the hour, this is when everyone else's scheduled jobs run, there's significant contention for resources the first minutes of every hour.
  • Skip GetQueryExecution, download the CSV from S3 directly. The GetQueryExecution call is convenient if you want to know the data types of the columns, but if you already know, or don't care, reading the data directly can save you some precious tens of milliseconds. If you need the column data types you can get the ….csv.metadata file that is written alongside the result CSV, it's undocumented Protobuf data, see here and here for more information.
  • Ask the Athena service team to tune your account. This might not be something you can get without higher tiers of support, I don't really know the politics of this and you need to start by talking to your account manager.

这篇关于AWS Athena对于api来说太慢了?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆