Kafka or SNS or something else? [closed]
Sorry if it is a newbie question. But I'm trying to understand what should I use. As far as I understand Kafka is :
Apache Kafka is a distributed publish-subscribe messaging system.
And SNS is also pub/sub system.
My goal is to use some queue messaging system on AWS with application that will be distributed over few servers (By the way the main language is Python). And because it is on amazon, my first thought was to use SNS and SQS. But then I saw a lot of people using Kafka on AWS. What are the advantages of one over another?
The use-cases for Kafka and Amazon SQS/Amazon SNS are quite different.
Kafka, as you wrote, is a distributed publish-subscribe system. It is designed for very high throughput, processing thousands of messages per second. Of course you need to setup and cluster it for yourself. It supports multiple readers, which may "catch up" with the stream of messages at any point (well, as long as the messages are still on disk). You can use it both as a queue (using consumer groups) and as a topic.
An important characteristic is that you cannot selectively acknowledge messages as "processed"; the only option is acknowledging all messages up to a certain offset.
SQS/SNS on the other hand:
- no setup/no maintenance
- either a queue (SQS) or a topic (SNS)
- various limitations (on size, how long a message lives, etc)
- limited throughput: you can do batch and concurrent requests, but still achieving high throughputs would be expensive
- I'm not sure if the messages are replicated; however at-least-once guarantee delivery in SQS would suggest so
- SNS has notifications for email, SMS, SQS, HTTP built-in. With Kafka, you would probably have to code it yourself
- no "message stream" concept
So overall I would say SQS/SNS are well suited for simpler tasks and workloads with a lower volume of messages.
This is a classic trade-off:
AWS tools (SQS, SNS)
These will be easier for you to setup, and integrate with the rest of your architecture, especially if most of it is already running on AWS. It will also probably be cheaper at first, since they have a good pay as you go model, but the cost will not scale as well, so you have to think about that.
Apache Kafka
Here, you're using a highly popular (not trendy) distributed (this is important if you think you will scale a lot) PUB/SUB model. Nowadays, this model seems to be much preferred, since running analytics on the data going through the pipes is very common, and usually with an SOA architecture you can have a multitude of small services consuming the messages and doing their thing, without having the data be removed from the queue. You also get a lot of configuration options, so depending on your use case you can fine tune it to your needs. This means more work, but a more optimized service down the road.
Summary
This is a classic trade-off of speed of development and ease of development vs the best, very modular and personalized solution, that has more overhead for the first implementation but scales better.
Personal Advice
If you are prototyping something, favor speed of development, so AWS tools. If your requirements are frozen and require significant scale, definitely take the time to use kafka. I also am a big believer in using-open-source-makes-the-world-better, but that's not the biggest argument to use.
points mentioned above are really helpful in addition to above
- Its super difficult to multi-tenant SQS/SNS perhaps there is now way until creating separate queue for each tenant (very hard to maintain)
- Kafka is clusterable, cluster connected to apps and db’s in real time and provide key / value access of data. Retention period for each message , distribution and replication are bigger advantage -- Where is SQS is more of a blackbox, sends a message and receiver, receives mark it processed and delete.