What are the scalability concerns with pub/sub servers?

The actual memory usage of a single socket isn't that much.

What does eat up memory is the state associated with which client is interested in which updates, and which client has already received a particular update.

In a primitive implementation (i.e. using the OS network stack), the latter state is kept in the form of outgoing buffers -- so if an update is sent to 10,000 clients, the data is copied 10,000 times, each of the copies appended to an outgoing queue, where it is augmented with the requisite headers (that contain per-connection state), and then a descriptor is built for the hardware that instructs it to send a packet that is a concatenation of the headers and the payload.

The per-client copy of the payload is kept in memory until it is acknowledged by the client, and that is where the memory requirements come from. This memory cannot be paged out, so it creates memory and cache pressure on other applications.

There are implementations that implement parts of the network stack inside the server program itself, and these can avoid the copies by reference counting or recreating payloads on-demand, that allows you to get away with a lot less memory usage, but involves a lot of tricky coding to be truly scalable, especially multi-socket servers pose some interesting issues there that the OS network stack already knows how to work around.

The options you have

  1. run the pub/sub service on the same server as the app
  2. run the pub/sub service on a dedicated server with OS networking
  3. run the pub/sub service on a dedicated server with custom networking
  4. run the pub/sub service on multiple dedicated servers

are your escalation strategy as the service grows. Moving from shared to dedicated does not require much planning, and can be done as needed; once that has happened, it is time to prepare the further stages.

Scaling up to multiple servers is going to introduce nondeterminism into your system, as clients may receive updates in different order, so for this scaling step to be successful, your clients need to be aware of this and be able to present a consistent view -- whether that is trivial or difficult depends on your actual application.

tl;dr: no need to optimize prematurely. Split out the service so the first scaling step is a simple configuration change, and start optimizing as soon as that has happened.