What are the typical methods used to scale up/out email storage servers?
What I've tried:
- I have two email storage architectures. Old and new.
Old:
- courier-imapds on several (18+) 1TB-storage servers.
- If one of them show signs of running out of disk space, we migrate a few email accounts to another server.
- the servers don't have replicas. no backups either.
New:
- dovecot2 on a single huge server with 16TB (SATA) storage and a few SSDs
- we store fresh mails on the SSDs and run a doveadm purge to move mails older than a day to the SATA disks
- there is an identical server which has a max-15min-old rsync backup from the primary server
- higher-ups/management wanted to pack in as much storage as possible per server in order to minimise the cost of SSDs per server
- the rsync'ing is done because GlusterFS wasn't replicating well under that high small/random-IO.
- scaling out was expected to be done with provisioning another pair of such huge servers
- on facing disk-crunch issues like in the old architecture, manual moving of email accounts would be done.
Concerns/doubts:
- I'm not convinced with the synchronously-replicated filesystem idea works well for heavy random/small-IO. GlusterFS isn't working for us yet, I'm not sure if there's another filesystem out there for this use case. The idea was to keep identical pairs and use DNS round-robin for email delivery and IMAP/POP3 access. And if one the servers went down for whatever reasons (planned/unplanned), we'd move the IP to the other server in the pair.
- In filesystems like Lustre, I get the advantage of a single namespace whereby I do not have to worry about manually migrating accounts around and updating MAILHOME paths and other metadata/data.
Questions:
- What are the typical methods used to scale up/out with the traditional software (courier-imapd / dovecot)?
- Do traditional software that store on a locally mounted filesystem pose a roadblock to scale out with minimal "problems"? Does one have to re-write (parts of) these to work with an object-storage of some sort - such as OpenStack object storage?
Solution 1:
What I've seen with medium to large companies is redundant storage devices such as NetApp or EMC. In fact I was talking to an EMC rep about email storage a little while ago, and he said huge email servers are a very common sale for them.
Basically they take all of the storage issues away from the application. Performance for a lot of short random reads is achieved with SSD or battery backed memory cache. All the storage is in one place with multiple paths to redundant server modules, so there's no replication latency.
The application servers access the storage using NFS or iSCSI which is less flexible but sometimes required by the application not behaving well with NFS. That allows the storage to be shared by any number of servers over high speed ethernet, so you can scale to the maximum I/O performance of the storage box which you can then expand as needed.
As far as redundancy on the application servers, the cheapest is a software clustering package. There's also appliances like Big-IP that handle it at the network level and are OS independent. It depends heavily if the application can work reliably over NFS in parallel with other instances.
Solution 2:
I'm a bit wary of the big iron approach - it's often difficult to scale and people tend to adopt the approach of building a failover solution and hoping the it will work when an outage comes along. Most people stopped applying this approach to components such as disks and network cards a long time ago, but applying it to servers is a bit more tricky. You can shard the data by splitting the users via LDAP - but this does directly solve the replication issue, however by running with, say 8 pairs of load-balanced servers, the contention would likely be significantly less. Certainly glusterFS IMHO does not work well on highly transactional systems. Nor do I think that a nearline type system (e.g. AFS) is a good idea either. The problem being that there are lots of little changes which need to be carried on on all the mirrors synchronously - and really this is best done at the application level to maintain consistency/
Although Dovecot is intended to work with mutltiple server on shared storage, a NFS/iscsi type approach still implies a SPOF or failover type of approach rather than load balancing. I hear lots of good things about GFS2 for transactional workloads - and by reducing down to smaller systems, then running this on top of DRBD would give the replication required. But do try to isolate the pairs on dedicated switches (or use cross over ethernet connections) to keep the noise of the main network.
i.e. I'm sorry to say that although dovecot is probably a lot better than courier for this type of operation, I think your new architecture is a step backwards.
(I'm guessing that you are using maildir rather than mbox)
I'd agree that having a fixed set of mappings of users to clusters is a bit of an overhead - and not the most efficient usage of available resource - perhaps the best compromise might be LVS on top of a GFS2 SAN and letting the SAN handle to replication stuff. For a cheaper solution, although this is a lot of guesswork, and would require some investigation/testing, perhaps running a FUSE filesystem with a database backend - and using the database replication functionality (e.g. mysqlfs)