如何防止重复的SQS消息? [英] How to prevent duplicate SQS Messages?

查看:72
本文介绍了如何防止重复的SQS消息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Amazon SQS中防止重复消息的最佳方法是什么?我有一个等待抓取的域的SQS.在将新域添加到SQS之前,我可以检查保存的数据,以查看它是否最近已被爬网,以防止重复.

What is the best way to prevent duplicate messages in Amazon SQS? I have a SQS of domains waiting to be crawled. before I add a new domain to the SQS I can check with the saved data to see if it has been crawled recently, to prevent duplicates.

问题出在尚未爬网的域上.例如,如果队列中有1000个尚未爬网的域.这些链接中的任何一个都可以一次又一次地添加.这使我的SQS膨胀到成千上万的消息,这些消息大多是重复的.

The problem is with the domains that have not been crawled yet. For example if there is 1000 domains in the queue that have not been crawled. Any of those links could be added again, and again and again. Which swells my SQS to hundreds of thousands of messages that is mostly duplicates.

如何防止这种情况?有没有办法从队列中删除所有重复项?还是有一种在添加消息之前在队列中搜索消息的方法?我觉得这是任何拥有SQS的人都必须经历的问题.

How do I prevent this? Is there a way to remove all duplicates from a queue? Or is there a way to search a queue for a message before I add it? I feel this is a problem that anyone with a SQS must have experienced.

我可以看到的一个选项是在将域添加到SQS之前是否存储了一些数据.但是,如果我必须将数据存储两次,那一定会破坏首先使用SQS的意义.

One option that I can see is if I store some data before the domain is added to the SQS. But if I have to store the data twice, that kinda ruins the point of using a SQS in the first place.

推荐答案

正如提到的其他答案一样,您不能阻止来自SQS的重复消息.

As the other answers mentioned, you can't prevent duplicate messages coming through from SQS.

大多数情况下,您的邮件会一次发送给您的一个消费者,但您在某个阶段遇到重复.

Most of the time your messages will be handed to one of your consumers once, but you will run into duplicates at some stage.

我认为这个问题没有一个简单的答案,因为它需要提出一种可以应对重复项的适当架构,这意味着它本质上是幂等的.

I don't think there is an easy answer to this question, because it entails coming up with a proper architecture that can cope with duplicates, meaning it's idempotent in nature.

如果您的分布式体系结构中的所有工作器都是幂等的,这将很容易,因为您无需担心重复项.但是实际上,这种环境并不存在,在某种方式下某些地方将无法处理.

If all the workers in your distributed architecture were idempotent, it would be easy, because you wouldn't need to worry about duplicates. But in reality, that sort of environment does not exist, somewhere along the way something will not be able to handle it.

我目前正在从事一个需要我解决此问题的项目,并提出了解决该问题的方法.我认为在这里分享我的想法可能会有益于其他人.这可能是一个获得我的想法反馈的好地方.

I am currently working on a project where it's required of me to solve this, and come up with an approach to handle it. I thought it might benefit others to share my thinking here. And it might be a good place to get some feedback on my thinking.

事实商店

开发服务是一个很好的主意,以便它们收集在理论上可以重播的事实,以便在所有受影响的下游系统中重现相同的状态.

It's a pretty good idea to develop services so that they collect facts which can theoretically be replayed to reproduce the same state in all the affected downstream systems.

例如,假设您正在为股票交易平台构建消息代理.(我以前实际上曾从事过这样的项目,这很可怕,但也有很好的学习经验.)

For example, let's say you are building a message broker for a stock trading platform. (I have actually worked on a project like this before, it was horrible, but also a good learning experience.)

现在让我们说交易进入了,有3个对此感兴趣的系统:

Now let's say that that trades come in, and there are 3 systems interested in it:

  1. 需要更新的旧式大型机
  2. 整理所有交易并将其与FTP服务器上的合作伙伴共享的系统
  3. 记录交易并将股票重新分配给新所有者的服务

我知道这有点令人费解,但是想法是,传入的一条消息(事实)具有各种分布式的下游影响.

It's a bit convoluted, I know, but the idea is that one message (fact) coming in, has various distributed downstream effects.

现在让我们想象一下,我们维护着一家事实商店,记录着进入我们经纪人的所有交易.并且所有3个下游服务所有者都致电给我们,告诉我们他们已经失去了最近3天的所有数据.FTP下载滞后3天,大型机滞后3天,所有交易滞后3天.

Now let's imagine that we maintain a fact store, a recording of all the trades coming into our broker. And that all 3 downstream service owners calls us to tell us that they have lost all of their data from the last 3 days. The FTP download is 3 days behind, the mainframe is 3 days behind, and all the trades are 3 days behind.

由于我们拥有事实存储,因此从理论上讲,我们可以从特定时间到特定时间重播所有这些消息.在我们的示例中,这是从三天前到现在.而且下游服务可能会赶上.

Because we have the fact store, we could theoretically replay all these messages from a certain time to a certain time. In our example that would be from 3 days ago until now. And the downstream services could be caught up.

这个例子可能看起来有点过头,但是我试图传达一些非常特别的东西:事实是要跟踪的重要内容,因为这就是我们将在体系结构中使用它来对抗重复项的地方.

This example might seem a bit over the top, but I'm trying to convey something very particular: the facts are the important things to keep track of, because that's where we are going to use in our architecture to battle duplicates.

事实商店如何帮助我们处理重复的邮件

假设您在持久层上实现了事实存储,从而为您提供了 CAP定理的CA部分,一致性和可用性,您可以执行以下操作:

Provided you implement your fact store on a persistence tier that gives you the CA parts of the CAP theorem, consistency and availability, you can do the following:

从队列中接收到消息后,立即检查事实存储区是否以前已经看到过该消息,是否已经看到过此消息是否处于锁定状态并且处于挂起状态.就我而言,我将使用MongoDB来实现我的事实存储,因为我对此非常满意,但是其他各种DB技术也应该能够处理这个问题.

As soon as a message is received from a queue, you check in your fact store whether you've already seen this message before, and if you have, whether it's locked at the moment, and in a pending state. In my case, I will be using MongoDB to implement my fact store, as I am very comfortable with it, but various other DB technologies should be able to handle this.

如果事实还不存在,它将以未决状态和锁定到期时间插入到事实存储中.这应该使用原子操作来完成,因为您不希望这种情况发生两次!在这里,您可以确保服务的幂等.

If the fact does not exist yet, it gets inserted into the fact store, with a pending state, and a lock expiration time. This should be done using atomic operations, because you do not want this to happen twice! This is where you ensure your service's idempotence.

快乐的情况-大多数情况下都会发生

当Fact存储返回到您的服务时,告知该事实不存在,并且已创建锁,该服务将尝试完成该工作.完成后,它删除SQS消息,并将事实标记为已完成.

When the Fact store comes back to your service telling it that the fact did not exist, and that a lock was created, the service attempts to do it's work. Once it's done, it deletes the SQS message, and marks the fact as completed.

重复邮件

因此,这是消息通过时发生的事情,它不是重复的消息.但是,让我们看一下何时出现重复消息.服务将其选中,并要求事实存储使用锁将其记录下来.事实存储告诉它它已经存在,并且已被锁定.该服务将忽略该消息并跳过它!消息处理完成后,其他工作人员将从队列中删除此消息,我们将不会再看到它.

So that's what happens when a message comes through and it's not a duplicate. But let's look at when a duplicate message comes in. The service picks it up, and asks the fact store to record it with a lock. The fact store tells it that it already exists, and that it's locked. The service ignores the message and skips over it! Once the message processing is done, by the other worker, it will delete this message from the queue, and we won't see it again.

灾难案例-很少发生

那么,当服务首次在商店中记录事实,然后获得一定时间的锁定但失败时,会发生什么呢?SQS会再次向您显示一条消息(如果该消息已被拾取,但在从队列中被送达后的一定时间内未删除).这就是为什么我们对事实存储进行编码,以使服务在有限的时间内保持锁定.因为如果失败了,我们希望SQS在以后的时间向服务或服务的另一个实例显示消息,从而允许该服务假定该事实应再次合并到状态(已执行)中.

So what happens when a service records the fact for the first time in the store, then get a lock for a certain period, but falls over? Well SQS will present a message to you again, if it was picked up, but not deleted within a certain period after it was served from the queue. Thats why we code up our fact store such that a service maintains a lock for a limited time. Because if it falls over, we want SQS to present the message to the service, or another instance thereof at a later time, allowing that service to assume that the fact should be incorporated into state (executed) again.

这篇关于如何防止重复的SQS消息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆