Is Apache Kafka appropriate for use as an unordered task queue?
Kafka splits incoming messages up into partitions, according to the partition assigned by the producer. Messages from partitions then get consumed by consumers in different consumer groups.
This architecture makes me wary of using Kafka as a work/task queue, because I have to specify the partition at time of production, which indirectly limits which consumers can work on it because a partition is sent to only one consumer in a consumer group. I would rather not specify the partition ahead of time, so that whichever consumer is available to take that task can do so. Is there a way to structure partitions/producers in a Kafka architecture where tasks can be pulled by the next available consumer, without having to split up work ahead of time by choosing a partition when the work is produced?
Using only one partition for this topic would put all the tasks in the same queue, but then the number of consumers is limited to 1 per consumer group, so each consumer would have to be in a different group. Then all of the task get distributed to each consumer group, though, which is not the kind of work queue I'm looking for.
Is Apache Kafka appropriate for use as a task queue?
Using Kafka for a task queue is a bad idea. Use RabbitMQ instead, it does it much better and more elegantly.
Although you can use Kafka for a task queue - you will get some issues: Kafka is not allowing to consume a single partition by many consumers (by design), so if for example a single partition gets filled with many tasks and the consumer who owns the partition is busy, the tasks in that partition will get "starvation". This also means that the order of consumption of tasks in the topic will not be identical to the order which the tasks were produced which might cause serious problems if the tasks needs to be consumed in a specific order (in Kafka to fully achieve that you must have only one consumer and one partition - which means serial consumption by just one node. If you have multiple consumers and multiple partitions the order of tasks consumption will not be guaranteed in the topic level).
In fact - Kafka topics are not queues in the computer science manner. Queue means First in First out - this is not what you get in Kafka in the topic level.
Another issue is that it is difficult to change the number of partitions dynamically. Adding or removing new workers should be dynamic. If you want to ensure that the new workers will get tasks in Kakfa you will have to set the partition number to the maximum possible workers. This is not elegant enough.
So the bottom line - use RabbitMQ or other queues instead.
Having said all of that - Samza (by linkedin) is using kafka as some sort of streaming based task queue: Samza
Edit: scale considerations: I forgot to mention that Kakfa is a big data/big scale tool. If your job rate is huge then Kafka might be good option for you despite the things I wrote earlier, since dealing with huge scale is very challenging and Kafka is very good in doing that. If we are talking about smaller scales (say, up to few dosens/hundreds of jobs per second) then again Kafka is a poor choice compared to RabbitMQ.
I would say that this depends on the scale. How many tasks do you anticipate in a unit of time?
What you describe as your end goal is basically how Kafka works by default.
When you produce messages, default (most widely used) option is to use random partitioner, which chooses partitions in the round robin fashion, keeping partitions evenly used (so it's possible to avoid specifying a partition).
The main purpose of partitions is to parallelize processing of messages, so you should use it in such a manner.
Other commonly used "thing" that partitions are used for is assuring that certain messages get consumed in the same order as they are produced (then you specify partitioning key in such a way that all such messages end up in the same partition. E.g. using userId
as key would assure all users are processed in such a way).