ElasticSearch Node Failure

My Elasticsearch cluster dropped from 2B documents to 900M Records, on AWS it shows

Relocating shards: 4

Whilst Showing

Active Shards: 35

and

Active primary shards: 34

(Might not be relevant but here's rest of stats):

Number of nodes: 9

Number of data nodes: 6

Unassigned shards: 17

When running

GET /_cluster/allocation/explain

it returns:

{
  "index": "datauwu",
  "shard": 6,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NODE_LEFT",
    "at": "2019-10-31T17:02:11.258Z",
    "details": "node_left[removedforsecuritybecimparanoid1]",
    "last_allocation_status": "no_valid_shard_copy"
  },
  "can_allocate": "no_valid_shard_copy",
  "allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
  "node_allocation_decisions": [
    {
      "node_id": "removedforsecuritybecimparanoid2",
      "node_name": "removedforsecuritybecimparanoid2",
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "removedforsecuritybecimparanoid3",
      "node_name": "removedforsecuritybecimparanoid3",
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "removedforsecuritybecimparanoid4",
      "node_name": "removedforsecuritybecimparanoid4",
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "removedforsecuritybecimparanoid5",
      "node_name": "removedforsecuritybecimparanoid5",
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "removedforsecuritybecimparanoid6",
      "node_name": "removedforsecuritybecimparanoid6",
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "removedforsecuritybecimparanoid7",
      "node_name": "removedforsecuritybecimparanoid7",
      "node_decision": "no",
      "store": {
        "found": false
      }
    }
  ]
}

im a bit confused to what this exactly means, does this mean my elasticsearch cluster did not lose data, but is instead relocating it into different shards, or cannot it not find the shards?

If it cannot find shards, does this mean my data was lost? if so, what could be the reason, how can i prevent this from happening in the future?

I haven't setup replicas as i was indexing data, and replicas slow it down whilst indexing.

also side not, my record count dropped down to 400m at one point but then rose back up to 900m randomly. i don't know what this means and any insight would greatly be appreciated.

"reason": "NODE_LEFT"

And:

I haven't setup replicas as i was indexing data, and replicas slow it down whilst indexing.

If the node holding the primary shards has gone away, then yes, your data is gone. After all, if there are no replicas, then where would the cluster retrieve the data from, if the primary (and only) shards are no longer part of the cluster? You will either need to bring the node holding those shards back up and add it into the cluster, or the data is gone.

The error message is saying "You want me to allocate a primary shard for this index that I know exists, but there used to be another version of that primary shard that can't be found anymore, I won't allocate it again in case the previous primary comes back."

You can force Elasticsearch to reallocate the primary shard (and explicitly accept that the data in the previous primary shard is gone) by performing a reroute with allocate_stale_primary (doc):

curl -H 'Content-Type: application/json' \
    -XPOST '127.0.0.1:9200/_cluster/reroute?pretty' -d '{
    "commands" : [ {
        "allocate_stale_primary" :
            {
              "index" : "datauwu", "shard" : 6,
              "node" : "target-data-node-id",
              "accept_data_loss" : true
            }
        }
    ]
}'

Turning off replicas for anything but development with disposable data is usually a bad idea.

also side not, my record count dropped down to 400m at one point but then rose back up to 900m randomly. i don't know what this means and any insight would greatly be appreciated.

This happens because shards aren't visible in the cluster. This can happen if all copies of a shard are being allocated, relocated, or recovered. This corresponds with a RED cluster state. You can mitigate it by ensuring that you have at least 1 replica (though ideally you have a sufficient number of replicas set up to survive the loss of N data nodes in the cluster). This lets Elasticsearch keep one shard as the primary while it moves others around.

If you only have the primary and no replicas, then if a primary is being recovered or relocated, the data in that shard will not be visible in the cluster. Once the shard is active again, the documents in it become visible.

ElasticSearch Node Failure

Related

Recent Posts