elasticsearch v.s. MongoDB for filtering application [closed]

This question is about making an architectural choice prior to delving into the details of experimentation and implementation. It's about the suitability, in scalability and performance terms, of elasticsearch v.s. MongoDB, for a somewhat specific purpose.

Hypothetically both store data objects that have fields and values, and allow querying that body of objects. So presumably filtering out subsets of the objects according to fields selected ad-hoc, is something fit for both.

My application will revolve around selecting objects according to criteria. It would select objects by filtering simultaneously by more than a single field, put differently, its query filtering criteria would typically comprise anywhere between 1 and 5 fields, maybe more in some cases. Whereas the fields chosen as filters would be a subset of a much larger amount of fields. Picture some 20 field names existing, and each query is an attempt to filter the objects by few fields out of those overall 20 fields (It can be less or more than 20 overall field names existing, I just used this number to demonstrate the ratio of fields to fields used as filters in every discrete query). The filtering can be by the existence of the chosen fields, as well as by the field values, e.g. filtering out objects that have field A, and their field B is between x and y, and their field C is equal to w.

My application will be continuously doing this sort of filtering, whereas there would be nothing or very little constant in terms of which fields are used for the filtering at any moment. Perhaps in elasticsearch indexes need to be defined, but maybe even without indexes speed is at par with that of MongoDB.

As per the data getting into the store, there are no special details about that.. the objects would be almost never changed after having been inserted. Perhaps old objects would need to be dropped, I'd like to assume both data stores support expire deleting stuff internally or by an application made query. (Less frequently, objects that fit a certain query would need to be dropped as well).

What do you think? And, have you experimented this aspect?

I am interested in the performance and the scalability of it, of each of the two data stores, for this kind of task. This is the sort of an architectural desing question, and details of store-specific options or query cornerstones that should make it well architected are welcome as a demonstration of a fully thought-out suggestion.

Thanks!


Solution 1:

First off, there is an important distinction to make here: MongoDB is a general purpose database, Elasticsearch is a distributed text search engine backed by Lucene. People have been talking about using Elasticsearch as a general purpose database but know that it was not its' original design. I think that general purpose NoSQL databases and search engines are headed for consolidation but as it stands, the two come from two very different camps.

We are using both MongoDB and Elasticsearch in my company. We store our data in MongoDB and use Elasticsearch exclusively for its' full-text search capabilities. We only send a subset of the mongo data fields that we need to query to elastic. Our use case differs from yours in that our Mongo data changes all the time: a record, or a subset of the fields of a record, can be updated several times a day and this can call for re-indexing of that record to elastic. For that reason alone, using elastic as the sole data store is not a good option for us, as we can't update select fields; we would need to re-index a document in its' entirety. This is not an elastic limitation, this is how Lucene works, the underlying search engine behind elastic. In your case, the fact that records won't be changed once stored saves you from having to make that choice. Having said that, if data safety is a concern, I would think twice about using Elasticsearch as the only storage mechanism for your data. It may get there at some point but I'm not sure it's there yet.

In terms of speed, not only is Elastic/Lucene on par with the querying speed of Mongo, in your case where there is "very little constant in terms of which fields are used for the filtering at any moment", it could be orders of magnitude faster, especially as the datasets become larger. The difference lies in the underlying query implementations:

  • Elastic/Lucene use the Vector Space Model and inverted indexes for Information Retrieval, which are highly efficient ways of comparing record similarity against a query. When you query Elastic/Lucene, it already knows the answer; most of its' work lies in ranking the results for you by the most likely ones to match your query terms. This is an important point: search engines, as opposed to databases, can't guarantee you exact results; they rank results by how close they get to your query. It just so happens that most of the times, the results are close to exact.
  • Mongo's approach is that of a more general purpose data store; it compares JSON documents against one another. You can get great performance out of it by all means, but you need to carefully craft your indexes to match the queries you will be running. Specifically, if you have multiple fields by which you will query, you need to carefully craft your compound keys so that they reduce the dataset that will be queried as fast as possible. E.g. your first key should filter down the majority of your dataset, your second should further filter down what left, and so on and so forth. If your queries don't match the keys and the order of those keys in the defined indexes, your performance will drop quite a bit. On the other hand, Mongo is a true database, so if accuracy is what what you need, the answers it will give will be spot on.

For expiring old records, Elastic has a built in TTL feature. Mongo just introduced it as of version 2.2 I think.

Since I don't know your other requirements such as expected data size, transactions, accuracy or what your filters will look like, it's hard to make any specific recommendations. Hopefully, there is enough here to get you started.