Riak "error":"insufficient_vnodes_available"

We have 4 nodes Riak installation. They are running on Ubuntu 12.04 LTS Precise installed servers. We have installed 1.1.4 at August 1st 2012 and upgraded 1.2.0 when its available.

Server names are:

f1 - 10.10.0.12 - This is the first installed server. We have joined other ones to this server. This also serves Riak control. s2 - 10.10.0.22 - s3 - 10.10.0.23 - s4 - 10.10.0.24 - This server also serves Riak control.

This morning we've seen "insufficient nodes available" error at our applications log and restarted all nodes. 3 of them became available except "f1"

UPDATE : while I prepare this message live 3 nodes became unavailable and need restart Riak.

wolfiem@f01:~$ sudo /etc/init.d/riak start
Riak failed to start within 15 seconds,
see the output of 'riak console' for more information.
If you want to wait longer, set the environment variable
WAIT_FOR_ERLANG to the number of seconds to wait.

I've tried to set WAIT_FOR_ERLANG value to 60 seconds but I can't.

adding this line in vm.args didn't work:

-env WAIT_FOR_ERLANG 60

I also tried to set this from terminal but it didn't work either.

wolfiem@f01:~$ export WAIT_FOR_ERLANG=60

It still says "Riak failed to start within 15 seconds"

This is the console.log output:

2012-09-11 10:58:02.532 [info] <0.7.0> Application lager started on node '[email protected]'
2012-09-11 10:58:02.560 [warning] <0.148.0>@riak_core_ring_manager:reload_ring:231 No ring file available.
2012-09-11 10:58:02.585 [error] <0.164.0> CRASH REPORT Process <0.164.0> with 0 neighbours exited with reason: eaddrnotavail in gen_server:init_it/6 line 320

This is the error.log output

2012-09-11 10:58:02.585 [error] <0.164.0> CRASH REPORT Process <0.164.0> with 0 neighbours exited with reason: eaddrnotavail in gen_server:init_it/6 line 320

This is the crash.log output:

2012-09-11 10:58:02 =CRASH REPORT====
  crasher:
    initial call: mochiweb_socket_server:init/1
    pid: <0.164.0>
    registered_name: []
    exception exit: {eaddrnotavail,[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,320}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
    ancestors: [riak_core_sup,<0.135.0>]
    messages: []
    links: [<0.136.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 377
    stack_size: 24
    reductions: 403
  neighbours:

You can find the riak console output below:

wolfiem@f01:~$ riak console
Attempting to restart script through sudo -H -u riak
Exec: /usr/lib/riak/erts-5.9.1/bin/erlexec -boot /usr/lib/riak/releases/1.2.0/riak             -embedded -config /etc/riak/app.config             -pa /usr/lib/riak/basho-patches             -args_file /etc/riak/vm.args -- console
Root: /usr/lib/riak
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:8:8] [async-threads:64] [kernel-poll:true]


=INFO REPORT==== 11-Sep-2012::10:44:18 ===
    alarm_handler: {set,{system_memory_high_watermark,[]}}
** /usr/lib/riak/lib/observer-1.1/ebin/etop_txt.beam hides /usr/lib/riak/lib/basho-patches/etop_txt.beam
** Found 1 name clashes in code paths 
10:44:19.099 [info] Application lager started on node '[email protected]'
10:44:19.130 [warning] No ring file available.
10:44:19.158 [error] CRASH REPORT Process <0.164.0> with 0 neighbours exited with reason: eaddrnotavail in gen_server:init_it/6 line 320
/usr/lib/riak/lib/os_mon-2.2.9/priv/bin/memsup: Erlang has closed. 

=INFO REPORT==== 11-Sep-2012::10:44:19 ===
    alarm_handler: {clear,system_memory_high_watermark}
Erlang has closed
                 {"Kernel pid terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"}

Crash dump was written to: /var/log/riak/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}})

Solution 1:

Here: http://smartcloud.blogspot.hu/2013/01/setting-riak-cluster-in-amazon-ec2-just.html it says that the with 0 neighbours exited with reason error is due to an (at least partially) running riak instance, that sits on some port or other resource.

For me, it was an epmd instance that ran, i found it with ps ax |grep riak. After killing it, the problem went away.