simple and reliable centralized logging inside Amazon VPC

I need to set up centralized logging for a set of servers (10-20) in an Amazon VPC. The logging should be as to not lose any log messages in case any single server goes offline - or in the case that an entire availability zone goes offline. It should also tolerate packet loss and other normal network conditions without losing or duplicating messages. It should store the messages durably, at the minimum on two different EBS volumes in two availability zones, but S3 is a good place as well. It should also be realtime so that the messages arrive within seconds of their generation to two different availability zones. I also need to sync logfiles not generated via syslog, so a syslog-only centralized logging solution would not fulfill all the needs, although I guess that limitation could be worked around.

I have already reviewed a few solutions, and I will list them here:

Flume to Flume to S3: I could set up two logservers as Flume hosts which would store log messages either locally or in S3, and configure all the servers with Flume to send all messages to both servers, using the end-to-end reliability options. That way the loss of a single server shouldn't cause lost messages and all messages would arrive in two availability zones in realtime. However, there would need to be some way to join the logs of the two servers, deduplicating all the messages delivered to both. This could be done by adding a unique id on the sending side to each message and then write some manual deduplication runs on the logfiles. I haven't found an easy solution to the duplication problem.

Logstash to Logstash to ElasticSearch: I could install Logstash on the servers and have them deliver to a central server via AMQP, with the durability options turned on. However, for this to work I would need to use some of the clustering capable AMQP implementations, or fan out the deliver just as in the Flume case. AMQP seems to be a yet another moving part with several implementations and no real guidance on what works best this sort of setup. And I'm not entirely convinced that I could get actual end-to-end durability from logstash to elasticsearch, assuming crashing servers in between. The fan-out solutions run in to the deduplication problem again. The best solution that would seem to handle all the cases, would be Beetle, which seems to provide high availability and deduplication via a redis store. However, I haven't seen any guidance on how to set this up with Logstash and Redis is one more moving part again for something that shouldn't be terribly difficult.

Logstash to ElasticSearch: I could run Logstash on all the servers, have all the filtering and processing rules in the servers themselves and just have them log directly to a removet ElasticSearch server. I think this should bring me reliable logging and I can use the ElasticSearch clustering features to share the database transparently. However, I am not sure if the setup actually survives Logstash restarts and intermittent network problems without duplicating messages in a failover case or similar. But this approach sounds pretty promising.

rsync: I could just rsync all the relevant log files to two different servers. The reliability aspect should be perfect here, as the files should be identical to the source files after a sync is done. However, doing an rsync several times per second doesn't sound fun. Also, I need the logs to be untamperable after they have been sent, so the rsyncs would need to be in append-only mode. And log rotations mess things up unless I'm careful.

rsyslog with RELP: I could set up rsyslog to send messages to two remote hosts via RELP and have a local queue to store the messages. There is the deduplication problem again, and RELP itself might also duplicate some messages. However, this would only handle the things that log via syslog.

None of these solutions seem terribly good, and they have many unknowns still, so I am asking for more information here from people who have set up centralized reliable logging as to what are the best tools to achieve that goal.

I am the creator of LogZilla and we are just around the corner from releasing an Amazon EC2 Cloud solution of our software. I would love the opportunity to discuss your goals and the possibility to provide this solution for you. If you are interested, feel free to contact me.

Although I'm sure you could use rsyslog, we are using syslog-ng with tcp (you could also use tls encryption and disk-based buffering to both secure and help ensure message delivery).

Our test boxes are sending up to 3000 events per second without losing any - all on an Amazon EC2 micro box (mind you, that won't work in production mostly because of storage needs, but it is a testament to the work we've done).

For HA, it would be easier to use two destination log servers then to try and deduplicate them - then, just use a heartbeat between the two servers and fail to the standby if the primary goes offline. You can still do the dedup if you like, but the former tends to be much simpler to implement and works very well.

Syncing non-syslog files is a simple matter of parsing them through perl and sending them over syslog using Log::Syslog::Fast - there is an example of this included in the contrib directory of our software (checkout the svn if you want a copy). You could also just copy them up to the LogZilla server and pipe them directly into our pre-processor.

SAN booting Oracle VM from a UCS blade

Equivalent for the "pid file" stanza in newer versions of upstart

Does having a load balancer allow you to re-use socket connections?

How to set the default source (src) ip address when you have mulitple IP's on a virtual interface

Making ssh easier for our Windows user community

Which events specifically cause Windows 2008 to mark a SAN volume offline?

How can I run an application from a network share without prompting for Admin password?

OpenVPN failing on self-signed certificate over udp, works over tcp

How to make a variable in a parameterized Puppet class default to the value of another parameter?

Clearing Java certificates cache (force reload certificates)

What is the most efficient way to backup a directory full of large database backup files?

Nginx worker_processes and multi core cpu's - Do Hyper-threading cores count?