Is there a smarter way to reindex elasticsearch?

I ask because our search is in a state of flux as we work things out, but each time we make a change to the index (change tokenizer or filter, or number of shards/replicas), we have to blow away the entire index and re-index all our Rails models back into Elasticsearch ... this means we have to factor in downtime to re-index all our records.

Is there a smarter way to do this that I'm not aware of?

I think @karmi makes it right. However let me explain it a bit simpler. I needed to occasionally upgrade production schema with some new properties or analysis settings. I recently started to use the scenario described below to do live, constant load, zero-downtime index migrations. You can do that remotely.

Here are steps:

Assumptions:

You have index real1 and aliases real_write, real_read pointing to it,
the client writes only to real_write and reads only from real_read ,
_source property of document is available.

1. New index

Create real2 index with new mapping and settings of your choice.

2. Writer alias switch

Using following bulk query switch write alias.

curl -XPOST 'http://esserver:9200/_aliases' -d '
{
    "actions" : [
        { "remove" : { "index" : "real1", "alias" : "real_write" } },
        { "add" : { "index" : "real2", "alias" : "real_write" } }
    ]
}'

This is atomic operation. From this time real2 is populated with new client's data on all nodes. Readers still use old real1 via real_read. This is eventual consistency.

3. Old data migration

Data must be migrated from real1 to real2, however new documents in real2 can't be overwritten with old entries. Migrating script should use bulk API with create operation (not index or update). I use simple Ruby script es-reindex which has nice E.T.A. status:

$ ruby es-reindex.rb http://esserver:9200/real1 http://esserver:9200/real2

UPDATE 2017 You may consider new Reindex API instead of using the script. It has lot of interesting features like conflicts reporting etc.

4. Reader alias switch

Now real2 is up to date and clients are writing to it, however they are still reading from real1. Let's update reader alias:

curl -XPOST 'http://esserver:9200/_aliases' -d '
{
    "actions" : [
        { "remove" : { "index" : "real1", "alias" : "real_read" } },
        { "add" : { "index" : "real2", "alias" : "real_read" } }
    ]
}'

5. Backup and delete old index

Writes and reads go to real2. You can backup and delete real1 index from ES cluster.

Done!

Yes, there are smarter ways how to re-index your data without downtime.

First, never, ever use the "final" index name as your real index name. So, if you'd like to name your index "articles", don't use that name as a physical index, but create an index such as "articles-2012-12-12" or "articles-A", "articles-1", etc.

Second, create an alias "alias" pointing to that index. Your application will then use this alias, so you'll never need to manually change the index name, restart the application, etc.

Third, when you want or need to re-index the data, re-index them into a different index, let's say "articles-B" -- all the tools in Tire's indexing toolchaing support you here.

When you're done, point the alias to the new index. In this way, you not only minimize downtime (there isn't any), you also have a safe snapshot: if you somehow mess up the indexing into the new index, you can just switch back to the old one, until you resolve the issue.

Wrote up a blog post about how I handled reindexing with no downtime recently. Takes some time to figure out all the little things that need to be in place to do so. Hope this helps!

https://summera.github.io/infrastructure/2016/07/04/reindexing-elasticsearch.html

To summarize:

Step 1: Prepare New Index

Create your new index with your new mapping. This can be on the same instance of Elasticsearch or on a brand new instance.

Step 2: Keep Indexes Up To Date

While you're reindexing you want to keep both your new and old indexes up to date. For a write operation, this can be done by sending the write operation to a background worker on both the new and old index.

Deletes are a bit trickier because there is a race condition between deleting and reindexing the record into the new index. So, you'll want to keep track of the records that need to be deleted during your reindex and process these when you are finished. If you aren't performing many deletes, another way would be to eliminate the possibility of a delete during your reindex.

Step 3: Perform Reindexing

You’ll want to use a scrolled search for reading the data and bulk API for inserting. Since after Step 2 you'll be writing new and updated documents to the new index in the background, you want to make sure you do NOT update existing documents in the new index with your bulk API requests.

This means that the operation you want for your bulk API requests is create, not index. From the documentation: “create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary”. The main point here is you do not want old data from the scrolled search snapshot to overwrite new data in the new index.

There's a great script on github to help you with this process: es-reindex.

Step 4: Switch Over

Once you’re finished reindexing, it’s time to switch your search over to the new index. You’ll want to turn deletes back on or process the enqueued delete jobs for the new index. You may notice that searching the new index is a bit slow at first. This is because Elasticsearch and the JVM need time to warm up.

Perform any code changes you need so your application starts searching the new index. You can continue writing to the old index incase you run into problems and need to rollback. If you feel this is unnecessary, you can stop writing to it.

Step 5: Clean Up

At this point you should be completely transitioned to the new index. If everything is going well, perform any necessary cleanup such as:

Delete the old index host if it’s different from the new
Remove serialization code related to your old index