Is anybody using Splunk in a large-scale production environment? [closed]

I've been watching the videos at splunk.com and really it's hard to believe that one can get all those features for free, there's still that "where's the catch?" in the back of my head.

So it'd be great if anybody that is actually using it Splunk on production would like to share their experiences, perhaps highlighting its benefits over, say, Nagios?

Thanks much in advance.


Solution 1:

We're using it for 7+GB of data per day, but we pay for that. A lot. I think we get a bit of an academic discount, but mostly we managed to justify spending the money because it satisfied auditors about having somebody/something looking over our logs.

We also use nagios. We've configured nagios with some saved searches that call scripts that either generate nagios alerts or create RT tickets. So, for instance, over X login failures in a 5 minute time window (across all servers) will generate an alert. That's the kind of thing nagios can't really do on its own.

Previously we were using SEC to generate those kinds of alerts, but it didn't work as well and somebody still had to try to use grep on a 20GB file now and then.

I'm not sure we have any nagios alerts generated anymore; we've switched most, if not all, of that to generating RT tickets. The nagios alert model doesn't really work well for stuff based on log analysis, it's better at things with a state that can be good or bad, not a discrete event that may need investigating.

EDIT:

Yes, it really does make life a lot easier for us. It's substantially better than trying to grep through logs. We've got Windows, Linux and Solaris boxes sending it logs.

Does it magically find exactly what you want like some of the videos imply? No, it's got some limitations and you may have to do a bit of configuration to get it handling specific types of logs well. And overly "interesting" searches can require reading through the docs and then waiting a few minutes while the splunk server churns. But, seriously, it rocks. From what I've seen, there's really nothing else in its league.

Solution 2:

I've worked with both Splunk and Nagios and they serve two distinct differences.

Splunk does make searching through logs much simpler and easier to do. Having saved searches for common problems can be invaluable in identifying problems. I have 2 Splunk servers in different locations, they are both using the free edition as the pricing was out of range and the daily indexed amount is not enough to require purchasing more.

Nagios on the other hand makes for a great active monitoring platform. I have a 5 server distributed Nagios platform monitoring multiple geographical locations. It is very different than Splunk which monitors logfiles, Nagios can have service check plugins written to monitor just about anything actively and allow you be notified of problems so you can resolve them.

I find the two together gives a much better picture and does help in maintaining a network. Especially if it is a team versus an individual effort. Everyone involved is able to see the same picture.