node.js, mongodb, redis, on ubuntu performance degradation in production, RAM is free, CPU 100%

After few days of intense trial and errors, I'm glad to be able to say that I've understood where the bottleneck was, and I'll post it here so that other people can benefit from my findings.

The problem lies in the pub/sub connections that I was using with socket.io, and in particular in the RedisStore used by socket.io to handle inter-process communication of socket instances.

After realizing that I could implement easily my own version of pub/sub using redis, I decided to give it a try, and removed the redisStore from socket.io, leaving it with the default memory store(I don't need to broadcast to all connected clients but only between 2 different users connected possibly on different processes)

Initially I declared only 2 global redis connections x process for handling the pub/sub on every connected client, and the application was using less recources but I was still being affected by a constant CPU usage growth, so not much had changed. But then I decided to try to create 2 new connections to redis for each client to handle their pub/sub only on their sessions, then close the connections once the user disconnected. Then after one day of usage in production, the cpu's were still at 0-5%...bingo! no process restarts, no bugs, with the performance I was expecting to have. Now I can say that node.js rocks and am happy to have choosen it for building this app.

Fortunately redis has been designed to handle many concurrent connections(differently by mongo) and by default it's set at 10k, that leaves room for around 5k concurrent users, on a single redis instance, which is enough for the moment for me, but I've read that it can be pushed up to 64k concurrent connections, so this architecture should be solid enough I believe.

At this point I was thinking to implement some sort of connection pools to redis, to optimize it a little further, but am not sure if that won't cause again the pub/sub events to build up on the connections, unless each of them is destroyied and recreated each time, to clean them.

Anyway, thanks for your answers, and I'll be curious to know what you think, and if you have any other suggestion.

Cheers.


Do you have some source code to dump? It may be connections to database not closed? Processes waiting for HTTP connections that never close.

Can you post some logs?

Do a ps -ef and make sure nothing is still running. I have seen web processes leave zombies that won't die until you do a kill -9 . Sometimes shutdown doesn't work or doesn't work fully and those threads or processes will hold RAM and sometimes CPU.

It could be an infinite loop somewhere in the code or a crashed process holding ontop a db connection.

What NPM modules are using? Are they all the latest?

Are you catching exceptions? See: http://geoff.greer.fm/2012/06/10/nodejs-dealing-with-errors/ See: https://stackoverflow.com/questions/10122245/capture-node-js-crash-reason

General Tips:

http://clock.co.uk/tech-blogs/preventing-http-raise-hangup-error-on-destroyed-socket-write-from-crashing-your-nodejs-server

http://blog.nodejitsu.com/keep-a-nodejs-server-up-with-forever

http://hectorcorrea.com/blog/running-a-node-js-web-site-in-production-a-beginners-guide

https://stackoverflow.com/questions/1911015/how-to-debug-node-js-applications

https://github.com/dannycoates/node-inspector

http://elegantcode.com/2011/01/14/taking-baby-steps-with-node-js-debugging-with-node-inspector/


Not an answer per se, as your question is more of a tale than a one-answer point-out question.

Just to tell that I successfully built a node.js server with socket.io handling over 1 million persistent connections with a message payload average of 700 Bytes.

Network Interface Card at 1Gbps was saturating at the beginning, and I was seeing a LOT of I/O wait from publish events to all clients.

Removing nginx from the proxy role also had returned precious memory, because to reach one million persistent connections with only ONE server, is a tough job of tweaking configs, application, and tuning OS parameters. Keep in mind that it's only doable with a lot of RAM (around 1M websockets connections eats about 16GB of RAM, with node.js, I think using sock.js would be ideal for low-memory consumption, but for now, socket.io consumes that much).

This link was my starting point to reach that volume of connections with node. Besides it being an Erlang app, all the OS tuning is pretty much application agnostic and should be of use by anyone who aims at a lot of persistent connections (websockets or long-polling).

HTH,