Log transport and aggregation at scale
How're you analysing log files from UNIX/Linux machines? We run several hundred servers which all generate their own log files, either directly or through syslog. I'm looking for a decent solution to aggregate these and pick out important events. This problem breaks down into 3 components:
1) Message transport
The classic way is to use syslog to log messages to a remote host. This works fine for applications that log into syslog but less useful for apps that write to a local file. Solutions for this might include having the application log into a FIFO connected to a program to send the message using syslog, or by writing something that will grep the local files and send the output to the central syslog host. However, if we go to the trouble of writing tools to get messages into syslog would we be better replacing the whole lot with something like Facebook's Scribe which offers more flexibility and reliability than syslog?
2) Message aggregation
Log entries seem to fall into one of two types: per-host and per-service. Per-host messages are those which occur on one machine; think disk failures or suspicious logins. Per-service messages occur on most or all of the hosts running a service. For instance, we want to know when Apache finds an SSI error but we don't want the same error from 100 machines. In all cases we only want to see one of each type of message: we don't want 10 messages saying the same disk has failed, and we don't want a message each time a broken SSI is hit.
One approach to solving this is to aggregate multiple messages of the same type into one on each host, send the messages to a central server and then aggregate messages of the same kind into one overall event. SER can do this but it's awkward to use. Even after a couple of days of fiddling I had only rudimentary aggregations working and had to constantly look up the logic SER uses to correlate events. It's powerful but tricky stuff: I need something which my colleagues can pick up and use in the shortest possible time. SER rules don't meet that requirement.
3) Generating alerts
How do we tell our admins when something interesting happens? Mail the group inbox? Inject into Nagios?
So, how're you solving this problem? I don't expect an answer on a plate; I can work out the details myself but some high-level discussion on what is surely a common problem would be great. At the moment we're using a mishmash of cron jobs, syslog and who knows what else to find events. This isn't extensible, maintainable or flexible and as such we miss a lot of stuff we shouldn't.
Updated: we're already using Nagios for monitoring which is great for detected down hosts/testing services/etc but less useful for scraping log files. I know there are log plugins for Nagios but I'm interested in something more scalable and hierarchical than per-host alerts.
Solution 1:
I've used three different systems for centralizing logs:
- Syslog/syslog-ng forwarding to one host
- Zenoss for aggregating and alerting events
- Splunk for log aggregation and search
For #3, I typically use syslog-ng to forward the messages from each host directly into splunk. It can also parse log files directly, but that can be a bit of a pain.
Splunk is pretty awesome for search and categorizing your logs. I haven't used splunk for log alerting, but I think it's possible.
Solution 2:
You can take a look at OSSEC, a complete open-source HIDS, it does log analysis & can trigger actions or send mail on alerts. Alerts are trigered by a set of simple XML based rules, a lot of pre-defined ones for various log formats are included and you can add your own rules
http://www.ossec.net/
Solution 3:
Take a look at Octopussy. It's fully customizable and seems to answer all your needs...
PS: I'm the developer of this solution.