I am evaluating Google Pub/Sub vs Kafka. What are the differences? [closed]

In addition to Google Pub/Sub being managed by Google and Kafka being open source, the other difference is that Google Pub/Sub is a message queue (e.g. Rabbit MQ) where as Kafka is more of a streaming log. You can't "re-read" or "replay" messages with Pubsub. (EDIT - as of 2019 Feb, you CAN replay messages and seek backwards in time to a certain timestamp, per comment below)

With Google Pub/Sub, once a message is read out of a subscription and ACKed, it's gone. In order to have more copies of a message to be read by different readers, you "fan-out" the topic by creating "subscriptions" for that topic, where each subscription will have an entire copy of everything that goes into the topic. But this also increases cost because Google charges Pub/Sub usage by the amount of data read out of it.

With Kafka, you set a retention period (I think it's 7 days by default) and the messages stay in Kafka regardless of how many consumers read it. You can add a new consumer (aka subscriber), and have it start consuming from the front of the topic any time you want. You can also set the retention period to be infinite, and then you can basically use Kafka as an immutable datastore, as described here: http://stackoverflow.com/a/22597637/304262

Amazon AWS Kinesis is a managed version of Kafka whereas I think of Google Pubsub as a managed version of Rabbit MQ. Amazon SNS with SQS is also similar to Google Pubsub (SNS provides the fanout and SQS provides the queueing).

I have been reading the answers above and I would like to complement them, because I think there are some details pending:

Fully Managed System Both system can have fully managed version in the cloud. Google provides Pubsub and there are some fully managed Kafka versions out there that you can configure on the cloud and On-prem.

Cloud vs On-prem I think this is a real difference between them, because Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka you can use as a both Cloud service and On-prem service (doing the cluster configuration by yourself)

Message duplication - With Kafka you will need to manage the offsets of the messages by yourself, using an external storage, such as, Apache Zookeeper. In that way you can track the messages read so far by the Consumers. Pubsub works using acknowledging the message, if your code doesn't acknowledge the message before the deadline, the message is sent again, that way you can avoid duplicated messages or another way to avoid is using Cloud Dataflow PubsubIO.

Retention policy Both Kafka and Pubsub have options to configure the maximum retention time, by default, I think is 7 days.

Consumers Group vs Subscriptions Be careful how you read messages in both systems. Pubsub use subscriptions, you create a subscription and then you start reading messages from that subscription. Once a message is read and acknowledge, the message for that subscription is gone. Kafka use the concept of "consumer group" and "partition", every consumer process belongs to a group and when a message is read from a specific partition, then any other consumer process which belongs to the same "consumer group" will not be able to read that message (that is because the offset eventually will increase). You can see the offset as a pointer which tells the processes which message have to read.

I think there is not a correct answer for your question, it will really depends on what you will need and the constrains you have (below are some examples of the escenarios):

If the solution must be in GCP, obviously use Google Cloud Pubsub. You will avoid all the settings efforts or pay extra for a fully automated system that Kafka requires.
If the solution should require process data in Streaming way but also needs to support Batch processing (eventually), it is a good idea to use Cloud Dataflow + Pubsub.
If the solution require to use some Spark processing, you could explore Spark Streaming (which you can configure Kafka for the stream processing)

In general, both are very solid Stream processing systems. The point which make the huge difference is that Pubsub is a cloud service attached to GCP whereas Apache Kafka can be used in both Cloud and On-prem.

Update (April 6th 2021):

Finally Kafka without Zookeeper

One big difference between Kafka vs. Cloud Pub/Sub is that Cloud Pub/Sub is fully managed for you. You don't have to worry about machines, setting up clusters, fine tune parameters etc. which means that a lot of DevOps work is handled for you and this is important, especially when you need to scale.

I am evaluating Google Pub/Sub vs Kafka. What are the differences? [closed]

Related

Recent Posts