Scaling Logstash (with redis/elasticsearch)

Over a cluster of over 12 centos 5.8 servers, I deployed logstash using the native logstash shipper, which sends /var/log/*/*.log back to a central logstash server.

We tried using rsyslogd as the shipper, but due to a bug in rsyslogd's ImFile module, if the remote end didn't reply, the logs would pile up in memory.

We're currently using Redis as the transport mechanism, so logstash01 has redis running locally, bound to the IP for the VLAN for these logs.

So logstash-shipper sends to redis on logstash01. logstash01 sends to Elasticsearch running in a separate process.

Here's what we're seeing. Elasticsearch has 141 blocked threads. Stracing the elasticsearch parent shows:

futex(0x7f4ccd1939d0, FUTEX_WAIT, 26374, NULL

Here's the jstack from elasticsearch

Here's the jstack from logstash

So.. Last night, some of the webservers (whose logs are tailed by logstash) went nuts, with load averages over 500.

On logstash01, there's this

Dec 19 00:44:45 logstash01 kernel: [736965.925863] Killed process 23429 (redis-server) total-vm:5493112kB, anon-rss:4248840kB, file-rss:108kB

So OOM-killer killed redis-server, which then meant logs piled up in memory on the servers which were shipping stuff.. Which somehow means that apache gets its knickers in a twist. (Frankly, I'm not sure how, I just assume it's tailing the log)..

This is my theory of how events unfolded:

  1. We had a traffic spike.
  2. An immense amount of logs were generated.
  3. These piled up in Redis, as logstash/elasticsearch only seems to be able to handle 300-400 new events / second.
  4. Redis had filled up entirely to the point where OOM-killer slaughtered it senselessly.
  5. Redis stops accepting new items.
  6. Items now start to pile up on the remote hosts side.
  7. Everything goes nuts. Apache stops accepting requests. (Why?).

Questions are these:

  1. Why does apache go nuts if there's just something tailing its log. Is it that the thing tailing it blocks apache from writing?

  2. Is there a sane way to make elasticsearch faster/better/resilient?

  3. Is there a sane way to make redis resilient and not die because of being OOM'd

  4. Is there a fundamental flaw in the way I've set it all up, or does everyone have this problem?

-- EDIT --

Some specs for @lusis.

admin@log01:/etc/init$ free -m
             total       used       free     shared    buffers     cached
Mem:          7986       6041       1944          0        743       1157
-/+ buffers/cache:       4140       3845
Swap:         3813       3628        185

Filesystem             Size  Used Avail Use% Mounted on
/dev/sda2               19G  5.3G   13G  31% /
udev                   3.9G  4.0K  3.9G   1% /dev
tmpfs                  1.6G  240K  1.6G   1% /run
none                   5.0M     0  5.0M   0% /run/lock
none                   3.9G     0  3.9G   0% /run/shm
/dev/sda1               90M   72M   14M  85% /boot
/dev/mapper/data-disk  471G  1.2G  469G   1% /data

/dev/sda2 on / type ext3 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
/dev/sda1 on /boot type ext2 (rw)
/dev/mapper/data-disk on /data type ext3 (rw)
/data/elasticsearch on /var/lib/elasticsearch type none (rw,bind)

log01:/etc/init$ top 
top - 14:12:20 up 18 days, 21:59,  2 users,  load average: 0.20, 0.35, 0.40
Tasks: 103 total,   1 running, 102 sleeping,   0 stopped,   0 zombie
Cpu0  :  3.0%us,  1.0%sy,  0.0%ni, 95.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu1  : 12.0%us,  1.0%sy,  0.0%ni, 86.6%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  :  4.7%us,  0.3%sy,  0.0%ni, 94.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  5.6%us,  1.3%sy,  0.0%ni, 93.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  5.3%us,  1.3%sy,  0.0%ni, 93.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  6.4%us,  1.0%sy,  0.0%ni, 92.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:   8178120k total,  6159036k used,  2019084k free,   761780k buffers

Your post doesn't describe much in the way of specs (memory on the LS indexer, log volume or much else) but I'll try and answer your questions best I can first. Disclaimer: I'm one of the logstash devs -

  1. Apache going nuts was likely a side effect of the logstash process acting up. I'd put that aside for now.

  2. The sane way to make ES f/b/s is add more ES nodes. It's seriously that easy. They even autodiscover each other depending on network topology. After 17 years in this industry I've never seen anything scale horizontally as easy as ElasticSearch.

  3. To f/b/s Redis, don't use any redis clustering. Newer versions of Logstash can do Redis loadbalancing internally. The Redis output supports a list of Redis hosts in the plugin config and support is about to be added to the input side as well to match that. In the interim you can use multiple Redis input definitions on the indexer/consumer side.

  4. I can't answer this beyond saying that it sounds like you're trying to do to much with a single (possibly underpowered host).

Any good scaling process starts with breaking collocated components into distinct systems. I don't see your configs gist'd anywhere but the places where logstash 'bottlenecks' are in filters. Depending on how many transformations you're doing it can have an impact on the memory usage of Logstash processes.

Logstash works a lot like legos. You can either use a 2x4 brick or you can use two 2x2 bricks to accomplish the same task. Except in the case of logstash, it's actually sturdier to use smaller bricks than a single big brick.

Some general advice we normally give is:

  • ship logs as quickly as possible from the edge If you can use pure network transport instead of writing to disk, that's nice but not required. Logstash is JVM-based and that has good and bad implications. Use an alternate shipper. I wrote a python based one ( https://github.com/lusis/logstash-shipper ) but I would suggest that folks use Beaver instead ( https://github.com/josegonzalez/beaver ).

  • generate your logs in a format that requires as little filtering as possible (json or optimally json-event format) This isn't always possible. I wrote a log4j appender to do this ( https://github.com/lusis/zmq-appender ) and eventually broke out the pattern layout into its own repo ( https://github.com/lusis/log4j-jsonevent-layout ). This means I don't have to do ANY filtering in logstash for those logs. I just set the type on input to 'json-event'

For apache, you can try this approach: http://cookbook.logstash.net/recipes/apache-json-logs/

  • break things into multiple components In every talk I've done about logstash, I describe it as a unix pipe on steroids. You can make the pipeline as long or as short as you like. You scale logstash by shifting around responsibilities horizontally. This might mean making the pipeline longer but we're not talking about anything statistically relevant in terms of latency overhead. If you have greater control over your network (i.e. NOT on EC2) you can do some amazing things with standard traffic isolation.

Also note that the logstash mailing list is VERY active so you should always start there. Sanitize and gist your configs as that's the best place to start.

There are companies (like Sonian) scaling ElasticSearch to petabyte levels and companies (like Mailchimp and Dreamhost) scaling Logstash to insane levels as well. It can be done.