SQS Cloudwatch健全性 [英] SQS Cloudwatch Sanity

查看:110
本文介绍了SQS Cloudwatch健全性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在分析我的SQS消费者服务上最近发生的负载事件,并且陷入了一些对我来说没有意义的SQS Cloudwatch指标。从本质上讲,队列看起来好像超载了度量标准中未考虑的消息。首先,我总结一下选定的5分钟内的数据:

I'm analyzing a recent load event on my SQS consumer service and am stuck with some SQS Cloudwatch metrics that don't make sense to me. Essentially, it looks like the queue was getting overloaded with messages that aren't accounted for in the metrics. Let me start by summarizing the data in a selected 5 minute period:


  • roximateNumberOfMessagesVisible:215,686-> 233,605(此期间的收益为17,919)

  • 大约NumberOfMessagesNotVisible:2,239-> 2,129(此期间损失110)

  • 发送的NumberOfMessages:31,441

  • NumberOfMessagesDeleted:24,665

  • ApproximateNumberOfMessagesVisible: 215,686 -> 233,605 (gain of 17,919 for this period)
  • ApproximateNumberOfMessagesNotVisible: 2,239 -> 2,129 (loss of 110 for this period)
  • NumberOfMessagesSent: 31,441
  • NumberOfMessagesDeleted: 24,665

让我感到困惑的是roximateNumberOfMessagesVisible的增益(+ 17k)比处理过的邮件(NumberOfMessagesSent-NumberOfMessagesDeleted =〜6k)。

What is baffling me is that the ApproximateNumberOfMessagesVisible is experiencing a gain (+17k) of many times more than the number of messages that were not processed (NumberOfMessagesSent - NumberOfMessagesDeleted = ~6k).

我还包括了有关不可见消息数量的指标(以防万一有一堆不可见消息突然变得可见),但似乎没有

I've included metrics about the number of invisible messages as well (just incase there was a bunch of invisible messages that suddenly became visible), but that doesn't seem to be the case.

这怎么可能?

推荐答案

如何显示消息?


  • 被发送到队列中。

  • By being sent to the queue.

由于已接收到该消息但未删除该消息,因此返回可见状态。由于可见消息超时之前未删除该消息,因此再次变得可见。

By being returned to visible status because the message was received, and not deleted, and thus became visible again because it was not deleted before its visibility timeout expired.

此处提供的历史记录不足以断定SQS的计数器是对还是错,但请考虑我对为什么SQS消息有时仍在队列中进行中

There isn't enough history provided here to conclusively state that SQS's counters are right or wrong, but consider this suggestion from an old comment of mine on Why do SQS Messages Sometimes Remain In-Flight on a Queue:


在Cloudwatch中,同时选择 N的图形umberOfMessagesReceived NumberOfMessagesDeleted 。您应该发现一个图形完美地覆盖了另一个图形,并且完全掩盖了另一个图形。如果在某种程度上他们没有这样做,则强烈暗示您正在使用的库或您的消费者中存在问题,这可能会导致您观察到症状。

In Cloudwatch, select both the graph for NumberOfMessagesReceived and NumberOfMessagesDeleted. You should find that one graph perfectly overlays and completely masks the other; if to some extent they don't, it strongly suggests a problem in the library that you are using or in your consumers, which would cause the symptoms you observe.

如果您从队列中删除一条消息,则一次只能收到一次,但是在发生之前,您可以多次收到一条消息某个进程无意或有意将消息丢在了地板上。它们再次变得可见,并且可见性超时到期后,SQS将重新分发它们。如果发生这种情况,上述两个指标将不会随着时间的推移而完美地对齐。

You can delete a single message from a queue only once, but you can receive a single message multiple times before that occurs, if you have a process that is dropping messages on the floor, accidentally or deliberately. They again become visible and SQS will redeliver them after the visibility timeout expires. If this is happening, the two metrics mentioned above will not line up perfectly over time.

否则,它们应该-您所看到的统计信息也应该如此。

Otherwise, they should -- as should the stats you are seeing.

所以,你是对的,这没有道理,如果您的员工行为均正常,并且正在处理并删除每条消息

So, you're right, it doesn't make sense, if your workers are all behaving perfectly and processing and deleting each message on the first attempt.

请注意,如果您使用AWS控制台检查消息,我提到的两个计数器将无法整齐地排列,因为控制台会先接收消息然后重置它们的可见性超时,就像普通消费者可能看到的那样,因此与删除计数器相比,这将人为地使接收计数器膨胀。

Note that if you use the AWS console to inspect messages, the two counters I mentioned will not line up cleanly, because the console receives messages and then resets their visibility timeout, just like a normal consumer might, so this will artificially inflate the receive counters compared to the delete counters.

这篇关于SQS Cloudwatch健全性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆