MongoDB Socket Exceptions when moving chunks

I have 2 shards in my mongodb db cluster, with 1 mongos server. I have a total of 8 servers, with one replica set having 5 and the other 3. I have a single collection that is sharded across the cluster, but recently when I attempt a chunk move I receive socket exceptions.

All of the servers are running on EC2, with the majority in the same availability zone. The Sending server and Receiving Server are in different availability zones.

Here are some excerpts from the logs:

Sending Server:

Fri May 20 07:53:28 [conn6158] moveChunk data transfer progress: { active: false, ns: "social_advantage_analytics.edges", from: "slytherin/draco:27018", min: { _id: "100000007993210_116269473289" }, max: { _id: "100000012316922_167580256615048" }, state: "fail", errmsg: "socket exception", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0

Fri May 20 07:53:28 [conn6158] warning: moveChunk error transfering data caused migration abort: { active: false, ns: "social_advantage_analytics.edges", from: "slytherin/draco:27018", min: { _id: "100000007993210_116269473289" }, max: { _id: "100000012316922_167580256615048" }, state: "fail", errmsg: "socket exception", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 }

Receiving Server:

Fri May 20 14:51:10 [migrateThread] about to log metadata event: { _id: "george-2011-05-20T14:51:10-293", server: "george", clientAddr: "(NONE)", time: new Date(1305903070637), what: "moveChunk.to", ns: "social_advantage_analytics.edges", details: { min: { _id: "100000007993210_116269473289" }, max: { _id: "100000012316922_167580256615048" }, note: "aborted" } }

Fri May 20 14:51:10 [migrateThread] ERROR: migrate failed: socket exception

Shard Server:

Fri May 20 07:53:05 [Balancer] balacer move failed: { cause: { active: false, ns: "social_advantage_analytics.edges", from: "slytherin/draco:27018", min: { _id: "100000007993210_116269473289" }, max: { _id: "100000012316922_167580256615048" }, state: "fail", errmsg: "socket exception", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 }, errmsg: "data transfer error", ok: 0.0 } from: pansy to: percy chunk: { _id: "social_advantage_analytics.edges-id"100000007993210_116269473289"", lastmod: Timestamp 90000|354, ns: "social_advantage_analytics.edges", min: { _id: "100000007993210_116269473289" }, max: { _id: "100000012316922_167580256615048" }, shard: "pansy" }


Solution 1:

This is an older question, but some important things to provide are:

  • What version of MongoDB are you running?
  • Can each server in the cluster communicate to every other server in the cluster on the assigned port (27018)? This includes the database shards and the mongos (balancer).

  • The shards have to talk amongst themselves to replicate the data, and since the Sending and Receiving nodes are in different zones, are they in the same security group? Are there local firewalls preventing communication?

  • Consider inspecting the amount of open file descriptors for the mongod process on each server. There is a hard max of 20k enforced in the server code, but typically there is another limit in place. Here's some commands to help:

    lsof -p <pid of mongod> | wc -l
    
    su - mongod # or whatever user mongod is running as
    ulimit -n # => some systems default to 1024