Refresh vs flush

Solution 1:

The answer that you got is correct but I think it's worth to elaborate a bit more.

A refresh effectively calls a reopen on the lucene index reader, so that the point in time snapshot of the data that you can search on gets updated. This lucene feature is part of the lucene near real-time api.

An elasticsearch refresh makes your documents available for search, but it doesn't make sure that they are written to disk to a persistent storage, as it doesn't call fsync, thus doesn't guarantee durability. What makes your data durable is a lucene commit, which is way more expensive.

While you can call lucene reopen every second, you cannot do the same with lucene commit.

Through lucene you can then have new documents available for search in near real-time by calling reopen pretty often, but you still need to call commit to ensure data is written to disk and fsynced, thus safe.

Elasticsearch solves this "problem" by adding a transaction log per shard (effectively a lucene index), where write operations that have not been committed yet are stored. The transaction log is fsynced and safe, thus you obtain durability at any point in time, even for documents that have not been committed yet. You can search on documents in near real-time as refresh happens automatically every second, and you can also be sure that if something bad happens the transaction log can be replayed to restore eventually lost documents. The nice thing about the transaction log is that it can be used internally for other things, for instance to provide real-time get by id.

An elasticsearch flush effectively triggers a lucene commit, and empties also the transaction log, since once data is committed on the lucene level, durability can be guaranteed by lucene itself. Flush is exposed as an api too and can be tweaked, although usually that is not necessary. Flush happens automatically depending on how many operations get added to the transaction log, how big they are, and when the last flush happened.

Solution 2:

A refresh causes a new segment to be written, so it becomes available for search.

A flush causes a Lucene commit to happen. This is a lot more expensive.

For more details, I've written an article that covers some of this: Elasticsearch from the bottom up :)

Solution 3:

  1. refresh: transform in-memory buffer to in-memory segment which can be searched.
  2. flush: (a) merge small segments to be a big segment (b) fsync the big segment to disk (c) empty translog.

Refresh:

refresh

Flush:

flush

Segment is a part of lucene. Immutable segments make OS page caches always clean.

Translog is a part of Elasticsearch. Translog is aim for durability.

Reference:

  1. Guide to Refresh and Flush Operations in Elasticsearch
  2. Official Doc for make doc persistent