How to prevent duplicate SQS Messages?
What is the best way to prevent duplicate messages in Amazon SQS? I have a SQS of domains waiting to be crawled. before I add a new domain to the SQS I can check with the saved data to see if it has been crawled recently, to prevent duplicates.
The problem is with the domains that have not been crawled yet. For example if there is 1000 domains in the queue that have not been crawled. Any of those links could be added again, and again and again. Which swells my SQS to hundreds of thousands of messages that is mostly duplicates.
How do I prevent this? Is there a way to remove all duplicates from a queue? Or is there a way to search a queue for a message before I add it? I feel this is a problem that anyone with a SQS must have experienced.
One option that I can see is if I store some data before the domain is added to the SQS. But if I have to store the data twice, that kinda ruins the point of using a SQS in the first place.
As the other answers mentioned, you can't prevent duplicate messages coming through from SQS.
Most of the time your messages will be handed to one of your consumers once, but you will run into duplicates at some stage.
I don't think there is an easy answer to this question, because it entails coming up with a proper architecture that can cope with duplicates, meaning it's idempotent in nature.
If all the workers in your distributed architecture were idempotent, it would be easy, because you wouldn't need to worry about duplicates. But in reality, that sort of environment does not exist, somewhere along the way something will not be able to handle it.
I am currently working on a project where it's required of me to solve this, and come up with an approach to handle it. I thought it might benefit others to share my thinking here. And it might be a good place to get some feedback on my thinking.
Fact store
It's a pretty good idea to develop services so that they collect facts which can theoretically be replayed to reproduce the same state in all the affected downstream systems.
For example, let's say you are building a message broker for a stock trading platform. (I have actually worked on a project like this before, it was horrible, but also a good learning experience.)
Now let's say that that trades come in, and there are 3 systems interested in it:
- An old school mainframe which needs to stay updated
- A system that collates all the trades and share it with partners on a FTP server
- The service that records the trade, and reallocates shares to the new owner
It's a bit convoluted, I know, but the idea is that one message (fact) coming in, has various distributed downstream effects.
Now let's imagine that we maintain a fact store, a recording of all the trades coming into our broker. And that all 3 downstream service owners calls us to tell us that they have lost all of their data from the last 3 days. The FTP download is 3 days behind, the mainframe is 3 days behind, and all the trades are 3 days behind.
Because we have the fact store, we could theoretically replay all these messages from a certain time to a certain time. In our example that would be from 3 days ago until now. And the downstream services could be caught up.
This example might seem a bit over the top, but I'm trying to convey something very particular: the facts are the important things to keep track of, because that's where we are going to use in our architecture to battle duplicates.
How the Fact store helps us with duplicate messages
Provided you implement your fact store on a persistence tier that gives you the CA parts of the CAP theorem, consistency and availability, you can do the following:
As soon as a message is received from a queue, you check in your fact store whether you've already seen this message before, and if you have, whether it's locked at the moment, and in a pending state. In my case, I will be using MongoDB to implement my fact store, as I am very comfortable with it, but various other DB technologies should be able to handle this.
If the fact does not exist yet, it gets inserted into the fact store, with a pending state, and a lock expiration time. This should be done using atomic operations, because you do not want this to happen twice! This is where you ensure your service's idempotence.
Happy case - happens most of the time
When the Fact store comes back to your service telling it that the fact did not exist, and that a lock was created, the service attempts to do it's work. Once it's done, it deletes the SQS message, and marks the fact as completed.
Duplicate message
So that's what happens when a message comes through and it's not a duplicate. But let's look at when a duplicate message comes in. The service picks it up, and asks the fact store to record it with a lock. The fact store tells it that it already exists, and that it's locked. The service ignores the message and skips over it! Once the message processing is done, by the other worker, it will delete this message from the queue, and we won't see it again.
Disaster case - happens rarely
So what happens when a service records the fact for the first time in the store, then get a lock for a certain period, but falls over? Well SQS will present a message to you again, if it was picked up, but not deleted within a certain period after it was served from the queue. Thats why we code up our fact store such that a service maintains a lock for a limited time. Because if it falls over, we want SQS to present the message to the service, or another instance thereof at a later time, allowing that service to assume that the fact should be incorporated into state (executed) again.
Amazon SQS Introduces FIFO Queues with Exactly-Once Processing and Lower Prices for Standard Queues
Using the Amazon SQS Message Deduplication ID The message deduplication ID is the token used for deduplication of sent messages. If a message with a particular message deduplication ID is sent successfully, any messages sent with the same message deduplication ID are accepted successfully but aren't delivered during the 5-minute deduplication interval.
Amazon SQS Introduces FIFO Queues
Using the Amazon SQS Message Deduplication ID
According to AWS Docs, Exactly-Once Processing
provides a way to avoid duplicate messages.
Unlike standard queues, FIFO queues don't introduce duplicate messages. FIFO queues help you avoid sending duplicates to a queue. If you retry the SendMessage action within the 5-minute deduplication interval, Amazon SQS doesn't introduce any duplicates into the queue.
If your queue is an FIFO queue and has enabled content based duplication, this function can be utilized to avoid duplicate messages during the deduplication interval. For more, read this section, and the below link.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-sqs-queues.html#cfn-sqs-queue-contentbaseddeduplication
There is no API level way of preventing duplicate messages to be posted to a SQS queue. You would need to handle this at application level I am afraid.
You can use a DynamoDB table to store your Domain Names waiting to be crawled and only add them to the queue if they are not in DynamoDB for example.