Using Kafka as a (CQRS) Eventstore. Good idea?

Although I've come across Kafka before, I just recently realized Kafka may perhaps be used as (the basis of) a CQRS, eventstore.

One of the main points that Kafka supports:

Event capturing / storing, all HA of course.
Pub / sub architecture
Ability to replay the eventlog which allows the ability for new subscribers to register with the system after the fact.

Admittedly I'm not 100% versed into CQRS / Event sourcing but this seems pretty close to what an eventstore should be. Funny thing is: I really can't find that much about Kafka being used as an eventstore, so perhaps I am missing something.

So, anything missing from Kafka for it to be a good eventstore? Would it work? Using it production? Interested in insight, links, etc.

Basically the state of the system is saved based on the transactions/events the system has ever received, instead of just saving the current state / snapshot of the system which is what is usually done. (Think of it as a General Ledger in Accounting: all transactions ultimately add up to the final state) This allows all kinds of cool things, but just read up on the links provided.

I am one of the original authors of Kafka. Kafka will work very well as a log for event sourcing. It is fault-tolerant, scales to enormous data sizes, and has a built in partitioning model.

We use it for several use cases of this form at LinkedIn. For example our open source stream processing system, Apache Samza, comes with built-in support for event sourcing.

I think you don't hear much about using Kafka for event sourcing primarily because the event sourcing terminology doesn't seem to be very prevalent in the consumer web space where Kafka is most popular.

I have written a bit about this style of Kafka usage here.

Kafka is meant to be a messaging system which has many similarities to an event store however to quote their intro:

The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. For example if the retention is set for two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so retaining lots of data is not a problem.

So while messages can potentially be retained indefinitely, the expectation is that they will be deleted. This doesn't mean you can't use this as an event store, but it may be better to use something else. Take a look at EventStore for an alternative.

UPDATE

Kafka documentation:

Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka's support for very large stored log data makes it an excellent backend for an application built in this style.

UPDATE 2

One concern with using Kafka for event sourcing is the number of required topics. Typically in event sourcing, there is a stream (topic) of events per entity (such as user, product, etc). This way, the current state of an entity can be reconstituted by re-applying all events in the stream. Each Kafka topic consists of one or more partitions and each partition is stored as a directory on the file system. There will also be pressure from ZooKeeper as the number of znodes increases.

I keep coming back to this QA. And I did not find the existing answers nuanced enough, so I am adding this one.

TL;DR. Yes or No, depending on your event sourcing usage.

There are two primary kinds of event sourced systems of which I am aware.

Downstream event processors = Yes

In this kind of system, events happen in the real world and are recorded as facts. Such as a warehouse system to keep track of pallets of products. There are basically no conflicting events. Everything has already happened, even if it was wrong. (I.e. pallet 123456 put on truck A, but was scheduled for truck B.) Then later the facts are checked for exceptions via reporting mechanisms. Kafka seems well-suited for this kind of down-stream, event processing application.

In this context, it is understandable why Kafka folks are advocating it as an Event Sourcing solution. Because it is quite similar to how it is already used in, for example, click streams. However, people using the term Event Sourcing (as opposed to Stream Processing) are likely referring to the second usage...

Application-controlled source of truth = No

This kind of application declares its own events as a result of user requests passing through business logic. Kafka does not work well in this case for two primary reasons.

Lack of entity isolation

This scenario needs the ability to load the event stream for a specific entity. The common reason for this is to build a transient write model for the business logic to use to process the request. Doing this is impractical in Kafka. Using topic-per-entity could allow this, except this is a non-starter when there may be thousands or millions of entities. This is due to technical limits in Kafka/Zookeeper.

One of the main reasons to use a transient write model in this way is to make business logic changes cheap and easy to deploy.

Using topic-per-type is recommended instead for Kafka, but this would require loading events for every entity of that type just to get events for a single entity. Since you cannot tell by log position which events belong to which entity. Even using Snapshots to start from a known log position, this could be a significant number of events to churn through if structural changes to the snapshot are needed to support logic changes.

Lack of conflict detection

Secondly, users can create race conditions due to concurrent requests against the same entity. It may be quite undesirable to save conflicting events and resolve them after the fact. So it is important to be able to prevent conflicting events. To scale request load, it is common to use stateless services while preventing write conflicts using conditional writes (only write if the last entity event was #x). A.k.a. Optimistic Concurrency. Kafka does not support optimistic concurrency. Even if it supported it at the topic level, it would need to be all the way down to the entity level to be effective. To use Kafka and prevent conflicting events, you would need to use a stateful, serialized writer (per "shard" or whatever is Kafka's equivalent) at the application level. This is a significant architectural requirement/restriction.

Bonus reason: fitment for problem

added 2021/09/29

Kafka is meant to solve giant-scale data problems and has commensurate overhead to do so. An app-controlled source of truth is a smaller scale, in-depth solution. Using event sourcing to good effect requires crafting events and streams to match the business processes. This usually has a much higher level of detail than would be generally useful to other parts of a system. Consider if your bank statement contained an entry for every step of a bank's internal processes. A single transaction could have many entries before it is confirmed to your account.

When I asked myself the same question as the OP, I wanted to know if Kafka was a scaling option for event sourcing. But perhaps a better question is whether it makes sense for my event sourced solution to operate at a giant scale. I can't speak to every case, but I think often it does not. When this scale enters the picture, the granularity of events tends to be different. And my event sourced system should probably publish higher granularity events to the Kafka cluster rather than use it as storage.

Scale can still be needed for event sourcing. Strategies differ depending on why. Often event streams have a "done" state and can be archived if storage or volume is the issue. Sharding is another option which works especially well for regional- or tenant-isolated scenarios. In less isolated scenarios, when streams are arbitrarily related in a way that can cross shard boundaries, sharding events is still quite easy (partition by stream ID). But things get more complicated for event consumers since events come from different shards and are no longer totally ordered. For example, you can receive transaction events before you receive events describing the accounts involved. Kafka has the same issue since events are only ordered within topics. Ideally you design the consumer so that ordering between streams would not be needed. Otherwise you resort to merging different sources and sorting by time stamp, then an arbitrary tie breaker (like shard ID) if the timestamps are the same. And it becomes important how out-of-sync a server's clock gets.

Summary

Further information

Can you force Kafka to work for an app-controlled source of truth? Sure if you try hard enough and integrate deeply enough. But is it a good idea? No.

Update per comment

The comment has been deleted, but the question was something like: what do people use for event storage then?

It seems that most people roll their own event storage implementation on top of an existing database. For non-distributed scenarios, like internal back-ends or stand-alone products, it is well-documented how to create a SQL-based event store. And there are libraries available on top of a various kinds databases. There is also EventStore, which is built for this purpose.

In distributed scenarios, I've seen a couple of different implementations. Jet's Panther project uses Azure CosmosDB, with the Change Feed feature to notify listeners. Another similar implementation I've heard about on AWS is using DynamoDB with its Streams feature to notify listeners. The partition key probably should be the stream id for best data distribution (to lessen the amount of over-provisioning). However, a full replay across streams in Dynamo is expensive (read and cost-wise). So this impl was also setup for Dynamo Streams to dump events to S3. When a new listener comes online, or an existing listener wants a full replay, it would read S3 to catch up first.

My current project is a multi-tenant scenario, and I rolled my own on top of Postgres. Something like Citus seems appropriate for scalability, partitioning by tentant+stream.

Kafka is still very useful in distributed scenarios. It is a non-trivial problem to expose each service's events to other services. An event store is not built for that typically, but that's precisely what Kafka does well. Each service has its own internal source of truth (could be event storage or otherwise), but listens to Kafka to know what is happening "outside". The service may also post events to Kafka to inform the "outside" of interesting things the service did.

You can use Kafka as event store, but I do not recommend doing so, although it might looks like good choice:

Kafka only guarantees at least once deliver and there are duplicates in the event store that cannot be removed. Update: Here you can read why it is so hard with Kafka and some latest news about how to finally achieve this behavior: https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/
Due to immutability, there is no way to manipulate event store when application evolves and events need to be transformed (there are of course methods like upcasting, but...). Once might say you never need to transform events, but that is not correct assumption, there could be situation where you do backup of original, but you upgrade them to latest versions. That is valid requirement in event driven architectures.
No place to persist snapshots of entities/aggregates and replay will become slower and slower. Creating snapshots is must feature for event store from long term perspective.
Given Kafka partitions are distributed and they are hard to manage and backup compare with databases. Databases are simply simpler :-)

So, before you make your choice you think twice. Event store as combination of application layer interfaces (monitoring and management), SQL/NoSQL store and Kafka as broker is better choice than leaving Kafka handle both roles to create complete feature full solution.

Event store is complex service which requires more than what Kafka can offer if you are serious about applying Event sourcing, CQRS, Sagas and other patterns in event driven architecture and stay high performance.

Feel free to challenge my answer! You might not like what I say about your favorite broker with lots of overlapping capabilities, but still, Kafka wasn't designed as event store, but more as high performance broker and buffer at the same time to handle fast producers versus slow consumers scenarios, for example.

Please look at eventuate.io microservices open source framework to discover more about the potential problems: http://eventuate.io/

Update as of 8th Feb 2018

I don't incorporate new info from comments, but agree on some of those aspects. This update is more about some recommendations for microservice event-driven platform. If you are serious about microservice robust design and highest possible performance in general I will provide you with few hints you might be interested.

Don't use Spring - it is great (I use it myself a lot), but is heavy and slow at the same time. And it is not microservice platform at all. It's "just" a framework to help you implement one (lot of work behind this..). Other frameworks are "just" lightweight REST or JPA or differently focused frameworks. I recommend probably best-in-class open source complete microservice platform available which is coming back to pure Java roots: https://github.com/networknt

If you wonder about performance, you can compare yourself with existing benchmark suite. https://github.com/networknt/microservices-framework-benchmark

Don't use Kafka at all :-)) It is half joke. I mean while Kafka is great, it is another broker centric system. I think future is in broker-less messaging systems. You might be surprised but there are faster then Kafka systems :-), of course you must get down to lower level. Look at Chronicle.
For Event store I recommend superior Postgresql extension called TimescaleDB, which focuses on high performance timeseries data processing (events are timeseries) in large volume. Of course CQRS, Event sourcing (replay, etc. features) are built in light4j framework out of the box which uses Postgres as low storage.
For messaging try to look at Chronicle Queue, Map, Engine, Network. I mean get rid of this old-fashioned broker centric solutions and go with micro messaging system (embedded one). Chronicle Queue is actually even faster than Kafka. But I agree it is not all in one solution and you need to do some development otherwise you go and buy Enterprise version(paid one). In the end the effort to build from Chronicle your own messaging layer will be paid by removing the burden of maintaining the Kafka cluster.