无服务器-与RethinkDB + AWS Lambda相比DynamoDB(糟糕)的性能 [英] Serverless - DynamoDB (terrible) performances compared to RethinkDB + AWS Lambda

查看:98
本文介绍了无服务器-与RethinkDB + AWS Lambda相比DynamoDB(糟糕)的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在将现有Node.js(Hapi.js)+ RethinkDB从OVH VPS(最小vps)迁移到AWS Lambda(节点)+ DynamoDB的过程中,我最近遇到了一个非常大的性能问题。 / p>

用法非常简单,人们使用在线工具,并且东西通过node.js服务器/ lambda保存在数据库中。这个东西需要一些空间,大约3kb(未压缩)(一个有很多键和子项的复杂对象,因此为什么使用NOSQL解决方案才有意义)



保存本身没有问题(目前...),没有多少人使用该工具,并且没有太多同步编写工作,因此使用Lambda而不是24/7运行VPS是有意义的。






真正的问题是当我要下载这些结果时。




  • 使用Node + RethinkDB大约需要 3秒来扫描整个表并生成CSV文件下载

  • AWS Lambda + DynamoDB 超时,在30秒后,即使我将结果分页以仅下载1000个项目,仍然需要20秒(这次没有超时,非常慢)->该表上有2200个项目,我们可以推论,如果AWS Lambda在30秒后不会超时,我们将需要大约45秒来下载整个表。



对于相同数量的提取数据,RethinkDB的时间大约需要3s,而DynamoDB的时间大约需要45秒。



现在让我们看看这些数据。表中有2200个项目,总共5MB,以下是DynamoDB统计信息:

 已配置的读取容量单位29(已启用Auto Scaling)
预配置的写入容量单位25(已启用Auto Scaling)
最后一次减少时间是2018年10月24日上午4:34:34 UTC + 2
UTC:2018年10月24日UTC

当地时间2:34:34 AM UTC + 2

地区(爱尔兰):2018年10月24日上午2:34:34 UTC

上次增加时间:2018年10月24日UTC +2
UTC:34:34 AM UTC:2018年10月24日上午10:22:07 UTC

当地:UTC 2018年10月24日下午12:22:07 UTC + 2

地区(爱尔兰):2018年10月24日上午10:22:07 UTC

存储大小(以字节为单位)5.05 MB
项目数2,195

提供了5个预置的读/写容量单位,最大可自动缩放为300。但是,自动缩放似乎并没有达到我的预期,从5扩展到了29,可以使用300,这足以降低在30秒内达到5MB,但没有使用它们(我只是开始使用自动缩放功能,所以我猜它配置错误?)





在这里我们可以看到自动缩放的效果,它确实增加了读取容量单位的数量,但是这样做太迟了,超时已经发生。我尝试过连续几次下载数据,即使使用29个单元,也没有看到太大的改进。



Lambda本身配置有128MB RAM ,增加到1024MB无效(正如我期望的那样,它确认问题来自DynamoDB扫描持续时间)






因此,所有这些使我感到奇怪,为什么DynamoDB在30秒之内不能完成RethinkDB在3秒之内的工作,它与任何类型的索引都不相关,因为该操作是扫描,因此必须以任何顺序遍历数据库中的所有项目



我想知道我应该如何使用DynamoDB来获取巨大的数据集(5MB!)以生成CSV。



我真的很想知道DynamoDB是否是适合该工作的工具,我真的没想到与过去使用的性能(mongo,重新思考,postgre等)相比如此低的性能



我想这全都归结为正确的配置(可能还有很多地方需要改进),但是即使如此,w为什么下载大量数据如此痛苦? 5MB没什么大不了的,但是如果感觉到它需要很多努力和关注,那只是导出单个表(统计信息,转储以进行备份等)的一种常见操作。






编辑:自从我创建此问题以来,我阅读了 https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might -be-improved-a92029c8c10b ,它详细说明了我遇到的问题。基本上,自动缩放的触发速度缓慢,这说明了为什么我的用例无法正确缩放。如果您想了解DynamoDB自动缩放的工作原理,则必须阅读这篇文章。

解决方案

我在我的应用程序中遇到了完全相同的问题(即DynamoDB自动缩放无法以足够快的速度启动-要求高强度的工作。)



当我能够解决问题时,我已经非常致力于DynamoDB了,所以我努力解决了这个问题。这就是我所做的。



当我要开始高强度工作时,我以编程方式增加了DynamoDB表上的RCU和WCU。在您的情况下,您可能需要一个lambda来增加吞吐量,然后让该lambda启动另一个来完成高强度工作。请注意,增加准备金可能需要几秒钟,因此将其拆分为一个单独的lambda可能是一个好主意。



我将在下面遇到的问题上粘贴我的个人笔记。不好意思,但是我不介意将它们格式化为stackoverflow标记。






我们希望一直提供足够的吞吐量,因此用户具有快速的使用体验,更重要的是,操作不会失败。但是,我们只想提供足够的吞吐量来满足我们的需求,因为这会花费我们很多钱。



在大多数情况下,我们可以在表上使用自动缩放功能,这应该适应我们的预配置吞吐量达到实际消耗的数量(即,更多的用户=自动配置的更多吞吐量)。对于我们而言,这在两个关键方面失败了:



自动缩放仅在违反吞吐量提供阈值后大约10分钟才能增加吞吐量。当确实开始扩大规模时,这样做并不是很积极。这里有一个很棒的博客 https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b
当吞吐量实际上为零时,DynamoDB不会降低吞吐量。 AWS Dynamo无法自动缩减规模
这个地方我们真正需要管理吞吐量的是Invoice表WCU。 RCU比WCU便宜很多,因此读取数据时无需担心。对于大多数表,提供几个RCU和WCU应该足够了。但是,当从源中提取数据时,我们在发票表上的写入容量在30分钟内很高。



让我们想象我们只是依靠自动缩放。当用户开始提取时,我们将拥有5分钟的突发容量,这可能会或可能不会足够的吞吐量。自动缩放会在大约10分钟后启动(最多),但这样做会很麻烦-不能按我们需要的速度快速扩展。我们的准备水平不够高,我们将受到限制,而我们将无法获得所需的数据。如果多个进程同时运行,那么这个问题将更加严重-我们无法同时处理多个提取。



幸运的是,我们知道我们什么时候要做野兽发票表,所以我们可以以编程方式增加发票表上的吞吐量。以编程方式增加吞吐量似乎很快就会生效。大概在几秒钟内。我在测试中注意到,DynamoDB中的Metrics视图非常没用。它的更新速度确实很慢,我认为有时显示的信息有误。您可以使用AWS CLI描述该表,并实时查看吞吐量设置:



aws dynamodb describe-table --table-name DEV_Invoices



理论上,我们可以在提取开始时增加吞吐量,然后在完成提取时再次降低吞吐量。但是,尽管您可以随意增加吞吐量的设置,但是一天只能减少4次吞吐量的设置,尽管您每小时可以减少一次吞吐量(即24小时内减少27次)。 https://docs.aws.amazon。 com / amazondynamodb / latest / developerguide / Limits.html#default-limits-throughput 。这种方法行不通,因为我们的拨备减少很可能会失败。



即使正在执行自动伸缩,它仍必须遵守拨备减少规则。因此,如果我们减少了4次,那么自动缩放将需要等待一个小时才能再次减少-多数民众赞成在读写值上都是这样



以编程方式增加吞吐量这个想法,我们可以快速完成(比自动缩放要快得多),因此它可用于我们很少的高工作负载。提取后我们无法通过编程方式降低吞吐量(见上文),但是还有其他一些选择。



自动缩放以降低吞吐量



请注意,即使设置了自动缩放,我们也可以通过编程将其更改为所需的任何值(例如,高于最大自动缩放级别)。



我们只需依靠Autoscaling即可在提取完成后将容量降低一两个小时,而这不会花费太多。



虽然还有另一个问题。如果我们的消耗容量在提取后立即下降到零(很可能是这样),则不会将消耗数据发送到CloudWatch,并且Autoscaling不会做任何事情来减少预配置容量,从而使我们陷于高容量。



虽然有两个软糖选项可以解决此问题。首先,我们可以将最小和最大吞吐量设置设置为相同的值。因此,例如在Autoscaling中将最小和最大已配置的RCU设置为20,即使已消耗的容量为零,也可确保已配置的容量返回到20。我不确定为什么会这样,但是这行得通(我已经对其进行了测试,并且确实如此),AWS在此确认了解决方法:



https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html



另一个选择是创建Lambda函数,以尝试每分钟对表执行一次(失败的)读取和删除操作。失败的操作仍会消耗容量,这就是这种方式起作用的原因。这项工作可确保即使我们的实际消耗为零,也可以定期将数据发送到CloudWatch,因此自动缩放将正确地减少容量。



请注意,读写数据是分别发送到CloudWatch。因此,如果我们希望当实际消耗的WCU为零时WCU减少,则需要使用写操作(即删除)。同样,我们需要执行读取操作以确保更新了RCU。请注意,读取失败(如果该项目不存在)和删除操作(如果该项目不存在)失败,但仍然消耗吞吐量。



Lambda导致吞吐量降低



在先前的解决方案中,我们使用Lambda函数连续地轮询表,从而创建了CloudWatch数据,该数据使DynamoDB Autoscaling能够起作用。作为替代方案,我们可以有一个lambda,该lambda定期运行并在需要时按比例缩小吞吐量。当您描述 DynamoDB表时,将获得当前的预配置吞吐量以及上次增加日期时间和上次减少日期时间。因此,lambda可以说:如果预配置的WCU超过阈值,并且上一次吞吐量增加是在半小时之前(即我们不在提取过程中),那么就让吞吐量立即降低。



鉴于此代码比自动缩放选项更多,因此我不愿意这样做。


In the process of migrating an existing Node.js (Hapi.js) + RethinkDB from an OVH VPS (smallest vps) to AWS Lambda (node) + DynamoDB, I've recently come across a very huge performance issue.

The usage is rather simple, people use an online tool, and "stuff" gets saved in the DB, passing through a node.js server/lambda. That "stuff" takes some spaces, around 3kb non-gzipped (a complex object with lots of keys and children, hence why using a NOSQL solution makes sense)

There is no issue with the saving itself (for now...), not so many people use the tool and there isn't much simultaneous writing to do, which makes sense to use a Lambda instead of a 24/7 running VPS.


The real issue is when I want to download those results.

  • Using Node+RethinkDB takes about 3sec to scan the whole table and generate a CSV file to download
  • AWS Lambda + DynamoDB timeout after 30sec, even if I paginate the results to download only 1000 items, it still takes 20 sec (no timeout this time, just very slow) -> There are 2200 items on that table, and we could deduce that we'd need around 45sec to download the whole table, if AWS Lambda wouldn't timeout after 30sec

So, the operation takes around 3s with RethinkDB, and would theoretically take 45sec with DynamoDB, for the same amount of fetched data.

Let's look at those data now. There are 2200 items in the table, for a total of 5MB, here are the DynamoDB stats:

Provisioned read capacity units 29 (Auto Scaling Enabled)
Provisioned write capacity units    25 (Auto Scaling Enabled)
Last decrease time  October 24, 2018 at 4:34:34 AM UTC+2
UTC: October 24, 2018 at 2:34:34 AM UTC

Local: October 24, 2018 at 4:34:34 AM UTC+2

Region (Ireland): October 24, 2018 at 2:34:34 AM UTC

Last increase time  October 24, 2018 at 12:22:07 PM UTC+2
UTC: October 24, 2018 at 10:22:07 AM UTC

Local: October 24, 2018 at 12:22:07 PM UTC+2

Region (Ireland): October 24, 2018 at 10:22:07 AM UTC

Storage size (in bytes) 5.05 MB
Item count  2,195

There is 5 provisioned read/write capacity units, with an autoscaling max to 300. But the autoscaling doesn't seem to scale as I'd expect, went from 5 to 29, could use 300 which would be enough to download 5MB in 30 sec, but doesn't use them (I'm just getting started with autoscaling so I guess it's misconfigured?)

Here we can see the effect of the autoscaling, which does increase the amount of read capacity units, but it does so too late and the timeout has happened already. I've tried to download the data several times in a row and didn't really see much improvements, even with 29 units.

The Lambda itself is configured with 128MB RAM, increasing to 1024MB has no effect (as I'd expect, it confirms the issue comes from DynamoDB scan duration)


So, all this makes me wonder why DynamoDB can't do in 30sec what does RethinkDB in 3sec, it's not related to any kind of indexing since the operation is a "scan", therefore must go through all items in the DB in any order.

I wonder how am I supposed to fetch that HUGE dataset (5MB!) with DynamoDB to generate a CSV.

And I really wonder if DynamoDB is the right tool for the job, I really wasn't expecting so low performances compared to what I've been using by the past (mongo, rethink, postgre, etc.)

I guess it all comes down to proper configuration (and there probably are many things to improve there), but even so, why is it such a pain to download a bunch of data? 5MB is not a big deal but there if feels like it requires a lot of efforts and attention, while it's just a common operation to export a single table (stats, dump for backup, etc.)


Edit: Since I created this question, I read https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b which explains in-depth the issue I've met. Basically, autoscaling is slow to trigger, which explains why it doesn't scale right with my use case. This article is a must-read if you want to understand how DynamoDB auto-scaling works.

解决方案

I have come across exactly the same problem in my application (i.e. DynamoDB autoscaling does not kick in fast enough for an on-demand high intensity job).

I was pretty committed to DynamoDB by the time I can across the problem, so I worked around it. Here is what I did.

When I'm about to start a high-intensity job, I programatically increase the RCU and WCU on my DynamoDB table. In your case you could probably have one lambda to increase the throughput, then have that lambda kick off another one to do the high-intensity job. Note that increasing provision can take a few seconds, hence splitting this into a separate lambda is probably a good idea.

I will paste my personal notes on the problem I faced below. Apologies but I can't be bothered to format them into stackoverflow markup.


We want enough throughput provisioned all the time so that users have a fast experience, and even more importantly, don't get any failed operations. However, we only want to provision enough throughput to serve our needs, as it costs us money.

For the most part we can use Autoscaling on our tables, which should adapt our provisioned throughput to the amount actually being consumed (i.e. more users = more throughput automatically provisioned). This fails in two key aspects for us:

Autoscaling only increases throughput about 10 minutes after the throughput provision threshold is breached. When it does start scaling up, it is not very aggressive in doing so. There is a great blog on this here https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b. When there is literally zero consumption of throughput, DynamoDB does not decrease throughput. AWS Dynamo not auto-scaling back down The place we really need to manage throughput is on the Invoice table WCUs. RCUs are a lot cheaper than WCUs, so reads are less of a worry to provision. For most tables, provisioning a few RCU and WCU should be plenty. However, when we do an extract from the source, our write capacity on the Invoices table is high for a 30 minute period.

Lets imagine we just relied on Autoscaling. When a user kicked off an extract, we would have 5 minutes of burst capacity, which may or may not be enough throughput. Autoscaling would kick in after around 10 minutes (at best), but it would do so ponderously - not scaling up as fast we needed. Our provision would not be high enough, we would get throttled, and we would fail to get the data we wanted. If several processes were running concurrently, this problem would be even worse - we just couldn't handle multiple extracts at the same time.

Fortunately we know when we are about to beast the Invoices table, so we can programatically increase throughput on the Invoices table. Increasing throughput programatically seems to take effect very quickly. Probably within seconds. I noticed in testing that the Metrics view in DynamoDB is pretty useless. Its really slow to update and I think sometimes it just showed the wrong information. You can use AWS CLI to describe the table, and see what the throughput is provisioned at in real-time:

aws dynamodb describe-table --table-name DEV_Invoices

In theory we could just increase throughput when an extract started, and then reduce it again when we were finished. However, whilst you can increase throughput provision as often as you like, you can only decrease throughput provision 4 times in a day, although you can then decrease throughput once every hour (i.e. 27 times in 24 hours). https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html#default-limits-throughput. This approach is not going to work, as our decrease in provision might well fail.

Even if Autoscaling is in play, it still has to abide by the provisioning decrease rules. So if we've decreased 4 times, Autoscaling will have to wait an hour before decreasing again - and thats for both read and write values

Increasing throughput provision programatically is a good idea, we can do it fast (much faster than Autoscaling), so it works for our infrequent high workloads. We can't decrease throughput programtically after an extract (see above) but there are a couple of other options.

Autoscaling for throughput decrease

Note that even when Autoscaling is set, we can programtically change it to anything we like (e.g. higher than the maximum Autoscaling level).

We can just rely on Autoscaling to bring the capacity back down an hour or two after the extract has finished, that's not going to cost us too much.

There is another problem though. If our consumed capacity drops right down to zero after an extract, which is likely, no consumption data is sent to CloudWatch and Autoscaling doesn't do anything to reduce provisioned capacity, leaving us stuck on a high capacity.

There are two fudge options to fix this though. Firstly we can set the minimum and maximum throughput provision to be same the same value. So for example setting the minimum and maximum provisioned RCUs within Autoscaling to 20 will ensure that the provisioned capacity returns to 20, even if there is zero consumed capacity. Im not sure why but this works (i've tested it, and it does), AWS acknowledge the workaround here:

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html

The other option is to create a Lambda function to attempt to execute a (failed) read and delete operation on the table every minute. Failed operations still consume capacity which is why this works. This job ensures data is sent to CloudWatch regularly, even when our 'real' consumption is zero, and therefore Autoscaling will reduce capacity correctly.

Note that read and write data is sent separately to CloudWatch. So if we want WCUs to decrease when real consumed WCUs are zero, we need to use a write operation (i.e. a delete). Similarly we need a read operation to make sure RCUs are updated. Note that failed Reads (if the item does not exist) and failed Deletes (if the item does not exist) but still consume throughput.

Lambda for throughput decrease

In the previous solution we used a Lambda function to continously 'poll' the table, thus creating the CloudWatch data which enables the DynamoDB Autoscaling to function. As an alternative we could just have a lambda which runs regularly and scales down the throughput when required. When you 'describe' a DynamoDB table, you get the current Provisioned Throughput as well as the last increase datetime and last decrease datetime. So the lambda could say: if the provisioned WCUs are over a threshold and the last time we had a throughput increase was more than half an hour ago (i.e. we are not in the middle of an extract), lets decrease the throughput right down.

Given that this is more code than the Autoscaling option, im not inclined to do this one.

这篇关于无服务器-与RethinkDB + AWS Lambda相比DynamoDB(糟糕)的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆