How efficient can Meteor be while sharing a huge collection among many clients?
The short answer is that only new data gets sent down the wire. Here's how it works.
There are three important parts of the Meteor server that manage subscriptions: the publish function, which defines the logic for what data the subscription provides; the Mongo driver, which watches the database for changes; and the merge box, which combines all of a client's active subscriptions and sends them out over the network to the client.
Publish functions
Each time a Meteor client subscribes to a collection, the server runs a
publish function. The publish function's job is to figure out the set
of documents that its client should have and send each document property
into the merge box. It runs once for each new subscribing client. You
can put any JavaScript you want in the publish function, such as
arbitrarily complex access control using this.userId
. The publish
function sends data into the merge box by calling this.added
, this.changed
and
this.removed
. See the
full publish documentation for
more details.
Most publish functions don't have to muck around with the low-level
added
, changed
and removed
API, though. If a publish function returns a Mongo
cursor, the Meteor server automatically connects the output of the Mongo
driver (insert
, update
, and removed
callbacks) to the input of the
merge box (this.added
, this.changed
and this.removed
). It's pretty neat
that you can do all the permission checks up front in a publish function and
then directly connect the database driver to the merge box without any user
code in the way. And when autopublish is turned on, even this little bit is
hidden: the server automatically sets up a query for all documents in each
collection and pushes them into the merge box.
On the other hand, you aren't limited to publishing database queries.
For example, you can write a publish function that reads a GPS position
from a device inside a Meteor.setInterval
, or polls a legacy REST API
from another web service. In those cases, you'd emit changes to the
merge box by calling the low-level added
, changed
and removed
DDP API.
The Mongo driver
The Mongo driver's job is to watch the Mongo database for changes to
live queries. These queries run continuously and return updates as the
results change by calling added
, removed
, and changed
callbacks.
Mongo is not a real time database. So the driver polls. It keeps an
in-memory copy of the last query result for each active live query. On
each polling cycle, it compares the new result with the previous saved
result, computing the minimum set of added
, removed
, and changed
events that describe the difference. If multiple callers register
callbacks for the same live query, the driver only watches one copy of
the query, calling each registered callback with the same result.
Each time the server updates a collection, the driver recalculates each live query on that collection (Future versions of Meteor will expose a scaling API for limiting which live queries recalculate on update.) The driver also polls each live query on a 10 second timer to catch out-of-band database updates that bypassed the Meteor server.
The merge box
The job of the merge box is to combine the results (added
, changed
and removed
calls) of all of a client's active publish functions into a single data
stream. There is one merge box for each connected client. It holds a
complete copy of the client's minimongo cache.
In your example with just a single subscription, the merge box is essentially a pass-through. But a more complex app can have multiple subscriptions which might overlap. If two subscriptions both set the same attribute on the same document, the merge box decides which value takes priority and only sends that to the client. We haven't exposed the API for setting subscription priority yet. For now, priority is determined by the order the client subscribes to data sets. The first subscription a client makes has the highest priority, the second subscription is next highest, and so on.
Because the merge box holds the client's state, it can send the minimum amount of data to keep each client up to date, no matter what a publish function feeds it.
What happens on an update
So now we've set the stage for your scenario.
We have 1,000 connected clients. Each is subscribed to the same live
Mongo query (Somestuff.find({})
). Since the query is the same for each client, the driver is
only running one live query. There are 1,000 active merge boxes. And
each client's publish function registered an added
, changed
, and
removed
on that live query that feeds into one of the merge boxes.
Nothing else is connected to the merge boxes.
First the Mongo driver. When one of the clients inserts a new document
into Somestuff
, it triggers a recomputation. The Mongo driver reruns
the query for all documents in Somestuff
, compares the result to the
previous result in memory, finds that there is one new document, and
calls each of the 1,000 registered insert
callbacks.
Next, the publish functions. There's very little happening here: each
of the 1,000 insert
callbacks pushes data into the merge box by
calling added
.
Finally, each merge box checks these new attributes against its
in-memory copy of its client's cache. In each case, it finds that the
values aren't yet on the client and don't shadow an existing value. So
the merge box emits a DDP DATA
message on the SockJS connection to its
client and updates its server-side in-memory copy.
Total CPU cost is the cost to diff one Mongo query, plus the cost of 1,000 merge boxes checking their clients' state and constructing a new DDP message payload. The only data that flows over the wire is a single JSON object sent to each of the 1,000 clients, corresponding to the new document in the database, plus one RPC message to the server from the client that made the original insert.
Optimizations
Here's what we definitely have planned.
More efficient Mongo driver. We optimized the driver in 0.5.1 to only run a single observer per distinct query.
Not every DB change should trigger a recomputation of a query. We can make some automated improvements, but the best approach is an API that lets the developer specify which queries need to rerun. For example, it's obvious to a developer that inserting a message into one chatroom should not invalidate a live query for the messages in a second room.
The Mongo driver, publish function, and merge box don't need to run in the same process, or even on the same machine. Some applications run complex live queries and need more CPU to watch the database. Others have only a few distinct queries (imagine a blog engine), but possibly many connected clients -- these need more CPU for merge boxes. Separating these components will let us scale each piece independently.
Many databases support triggers that fire when a row is updated and provide the old and new rows. With that feature, a database driver could register a trigger instead of polling for changes.
From my experience, using many clients with while sharing a huge collection in Meteor is essentially unworkable, as of version 0.7.0.1. I'll try to explain why.
As described in the above post and also in https://github.com/meteor/meteor/issues/1821, the meteor server has to keep a copy of the published data for each client in the merge box. This is what allows the Meteor magic to happen, but also results in any large shared databases being repeatedly kept in the memory of the node process. Even when using a possible optimization for static collections such as in (Is there a way to tell meteor a collection is static (will never change)?), we experienced a huge problem with the CPU and Memory usage of the Node process.
In our case, we were publishing a collection of 15k documents to each client that was completely static. The problem is that copying these documents to a client's merge box (in memory) upon connection basically brought the Node process to 100% CPU for almost a second, and resulted in a large additional usage of memory. This is inherently unscalable, because any connecting client will bring the server to its knees (and simultaneous connections will block each other) and memory usage will go up linearly in the number of clients. In our case, each client caused an additional ~60MB of memory usage, even though the raw data transferred was only about 5MB.
In our case, because the collection was static, we solved this problem by sending all the documents as a .json
file, which was gzipped by nginx, and loading them into an anonymous collection, resulting in only a ~1MB transfer of data with no additional CPU or memory in the node process and a much faster load time. All operations over this collection were done by using _id
s from much smaller publications on the server, allowing for retaining most of the benefits of Meteor. This allowed the app to scale to many more clients. In addition, because our app is mostly read-only, we further improved the scalability by running multiple Meteor instances behind nginx with load balancing (though with a single Mongo), as each Node instance is single-threaded.
However, the issue of sharing large, writeable collections among multiple clients is an engineering problem that needs to be solved by Meteor. There is probably a better way than keeping a copy of everything for each client, but that requires some serious thought as a distributed systems problem. The current issues of massive CPU and memory usage just won't scale.
The experiment that you can use to answer this question:
- Install a test meteor:
meteor create --example todos
- Run it under Webkit inspector (WKI).
- Examine the contents of the XHR messages moving across the wire.
- Observe that the entire collection is not moved across the wire.
For tips on how to use WKI check out this article. It's a little out of date, but mostly still valid, especially for this question.
This is still a year old now and therefore I think pre-"Meteor 1.0" knowledge, so things may have changed again? I'm still looking into this. http://meteorhacks.com/does-meteor-scale.html leads to a "How to scale Meteor?" article http://meteorhacks.com/how-to-scale-meteor.html