Which NoSQL database should I use for logging? [closed]
Do you have any experience logging to NoSQL databases for scalable apps? I have done some research on NoSQL databases for logging and found that MongoDB seems to be a good choice. Also, I found log4mongo-net which seems to be a very straightforward option.
Would you recommend this kind of approach? Are there any other suggestions?
I've decided to revise this accepted answer as the state of the art has moved significantly in the last 18 months, and much better alternatives exist.
New Answer
MongoDB is a sub-par choice for a scalable logging solution. There are the usual reasons for this (write performance under load for example). I'd like to put forward one more, which is that it only solves a single use case in a logging solution.
A strong logging solution needs to cover at least the following stages:
- Collection
- Transport
- Processing
- Storage
- Search
- Visualisation
MongoDB as a choice only solves the Storage use case (albeit somewhat poorly). Once the complete chain is analysed, there are more appropriate solutions.
@KazukiOhta mentions a few options. My preferred end to end solution these days involves:
- Logstash-Forwarder for Collection & Transport
- Logstash & Riemann for Processing
- ElasticSearch for Storage & Queries
- Kibana3 for Visualisation
The underlying use of ElasticSearch for log data storage uses the current best of breed NoSQL solution for the logging and searching use case. The fact that Logstash-Forwarder / Logstash / ElasticSearch / Kibana3 are under the umbrella of ElasticSearch makes for an even more compelling argument.
Since Logstash can also act as a Graphite proxy, a very similar chain can be built for the associated problem of collecting and analysing metrics (not just logs).
Old Answer
MongoDB Capped Collections are extremely popular and suitable for logging, with the added bonus of being 'schema less', which is usually a semantic fit for logging. Often we only know what we want to log well into a project, or after certain issues have been found in production. Relational databases or strict schemas tend to be difficult to change in these cases, and attempts to make them 'flexible' tends just to make them 'slow' and difficult to use or understand.
But if you'd like to manage your logs in the dark and have lasers going and make it look like you're from space there's always Graylog2 which uses MongoDB as part of its overall infrastructure but provides a whole lot more on top such as a common, extensible format, a dedicated log collection server, distributed architecture and a funky UI.
I've seen a lot of companies are using MongoDB to store application logs. Its schema-freeness is really flexible for application logs, at which schema tends to change time-to-time. Also, its Capped Collection feature is really useful because it automatically purges old data to keep the data fit into the memory.
People aggregates the logs by normal Grouping or MapReduce, but it's not that fast. Especially MongoDB's MapReduce only works within a single thread and its JavaScript execution overhead is huge. New aggregation framework could solve this problem.
When you use MongoDB for logging, the concern is the lock contention by high write throughputs. Although MongoDB's insert is fire-and-forget style by default, calling a lot of insert() causes a heavy write lock contention. This could affect the application performance, and prevent the readers to aggregate / filter the stored logs.
One solution might be using the log collector framework such as Fluentd, Logstash, or Flume. These daemons are supposed to be launched at every application nodes, and takes the logs from app processes.
They buffer the logs and asynchronously writes out the data to other systems like MongoDB / PostgreSQL / etc. The write is done by batches, so it's a lot more efficient than writing directly from apps. This link describes how to put the logs into Fluentd from PHP program.
- Fluentd: Data Import from PHP Applications
Here's some tutorials about MongoDB + Fluentd.
- Fluentd + MongoDB: The Easiest Way to Log Your Data Effectively on 10gen blog
- Fluentd: Store Apache Logs into MongoDB
MongoDB's problem is it starts slowing down when the data volume exceeds the memory size. At that point, you can switch to other solutions like Apache Hadoop or Cassandra. If you have a distributed logging layer mentioned above, you can instantly switch into another solution as you grow. This tutorial describes how to store logs to HDFS by using Fluentd.
- Fluentd: Fluentd + HDFS: Instant Big Data Collection