RabbitMQ randomly crashes
I've recently deployed a RabbitMQ server on AWS following the instructions for RPM-based Linux distros.
RabbitMQ 3.8.14
Erlang 23.3.1
It's a single-node deployment with only one vhost ("/") I've been using mostly the default configuration. I've also modified the open file limit initially to 65536 and currently to 150000. Although the used file descriptors remain at a relatively low level.
I have two apps that connect to RabitMQ. Each app is authenticated with its own user and password. Both users have full rights on the vhost and the topics they consume.
I've setup 7 exchanges and queues. All durable.
The server starts without any issue and the apps are working fine and are able to communicate with the server with no issue and consume from the topics but they randomly crash.
On the rabbitmq log file I can see the following
2021-04-21 02:49:27.342 [info] <0.32135.4> connection <0.32135.4> (10.11.234.236:39453 -> 172.24.22.82:5672): user 'mes' authenticated and granted access to vhost '/'
2021-04-21 02:49:27.374 [info] <0.32138.4> connection <0.32138.4> (10.11.252.54:23576 -> 172.24.22.82:5672): user 'mes' authenticated and granted access to vhost '/'
2021-04-21 03:02:37.689 [error] <0.31757.4> closing AMQP connection <0.31757.4> (10.11.161.117:45741 -> 172.24.22.82:5672):
{writer,send_failed,{error,timeout}}
2021-04-21 03:02:37.690 [info] <0.32596.4> Closing all channels from connection '10.11.161.117:45741 -> 172.24.22.82:5672' because it has been closed
2021-04-21 03:02:48.136 [info] <0.32614.4> accepting AMQP connection <0.32614.4> (10.11.161.117:2496 -> 172.24.22.82:5672)
2021-04-21 03:02:48.142 [info] <0.32614.4> connection <0.32614.4> (10.11.161.117:2496 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/'
2021-04-21 03:03:18.346 [error] <0.32614.4> closing AMQP connection <0.32614.4> (10.11.161.117:2496 -> 172.24.22.82:5672):
{inet_error,enotconn}
2021-04-21 03:03:18.347 [info] <0.32674.4> Closing all channels from connection '10.11.161.117:2496 -> 172.24.22.82:5672' because it has been closed
2021-04-21 03:03:30.140 [info] <0.32694.4> accepting AMQP connection <0.32694.4> (10.11.161.117:54985 -> 172.24.22.82:5672)
2021-04-21 03:03:30.144 [info] <0.32694.4> connection <0.32694.4> (10.11.161.117:54985 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/'
2021-04-21 03:04:00.387 [error] <0.32694.4> closing AMQP connection <0.32694.4> (10.11.161.117:54985 -> 172.24.22.82:5672):
{inet_error,enotconn}
2021-04-21 03:04:00.395 [info] <0.32752.4> Closing all channels from connection '10.11.161.117:54985 -> 172.24.22.82:5672' because it has been closed
2021-04-21 03:04:14.035 [info] <0.5.5> accepting AMQP connection <0.5.5> (10.11.161.117:63900 -> 172.24.22.82:5672)
2021-04-21 03:04:14.040 [info] <0.5.5> connection <0.5.5> (10.11.161.117:63900 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/'
2021-04-21 03:04:44.270 [error] <0.5.5> closing AMQP connection <0.5.5> (10.11.161.117:63900 -> 172.24.22.82:5672):
{inet_error,enotconn}
2021-04-21 03:04:44.271 [info] <0.56.5> Closing all channels from connection '10.11.161.117:63900 -> 172.24.22.82:5672' because it has been closed
2021-04-21 03:04:44.316 [error] <0.14.5> ** Generic server <0.14.5> terminating
** Last message in was {'$gen_cast',terminate}
** When Server state == {ch,{conf,running,rabbit_framing_amqp_0_9_1,1,<0.5.5>,<0.12.5>,<0.5.5>,<<"10.11.161.117:63900 -> 172.24.22.82:5672">>,undefined,{user,<<"cron">>,[],[{rabbit_auth_backend_internal,none}]},<<"/">>,<<>>,<0.6.5>,[{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true},{<<"basic.nack">>,bool,true},{<<"publisher_confirms">>,bool,true},{<<"consumer_cancel_notify">>,bool,true}],none,0,134217728,undefined,#{},1000000000},{lstate,<0.13.5>,false},none,5514,{5438,{[{5513,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265917}},{5512,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265916}},{5511,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265915}},{5510,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265914}},{5509,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265913}},{5508,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265912}},{5507,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265911}},{5506,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265910}},{5505,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,1618988684271,{<0.522.0>,265909}},{5504,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmai...">>,...},...],...}},...}
** Reason for termination ==
** noproc
2021-04-21 03:04:44.317 [info] <0.14.5> [{initial_call,{rabbit_channel,init,['Argument__1']}},{pid,<0.14.5>},{registered_name,[]},{error_info,{exit,noproc,[{gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1183}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,236}]}]}},{ancestors,[<0.11.5>,<0.9.5>,<0.4.5>,<0.3.5>,<0.853.0>,<0.852.0>,<0.851.0>,rabbit_sup,<0.274.0>]},{message_queue_len,38},{messages,[{'$gen_cast',{deliver,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,true,{{resource,<<"/">>,queue,<<"q.etm.mes.tmce_results">>},<0.522.0>,265918,true,{basic_message,{resource,<<"/">>,exchange,<<"x.etm.mes.tmce_results">>},[<<"test">>],{content,60,{'P_basic',<<"application/json">>,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},<<128,0,16,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110>>,rabbit_framing_amqp_0_9_1,[<<"{\"customerId\":5435103,\"campaignRunId\":66836,\"status\":\"Sent\",\"messageKey\":\"66836~5435103\",\"timeSent\":1618988531.764735000}">>]},<<225,81,222,134,43,120,7,4,135,105,190,34,66,200,149,86>>,false}}}},{'EXIT',<0.11.5>,shutdown},{'$gen_cast',{deliver,<<"TmceResultConsumer#void handleTmceResult(Envelope envelope,CampaignEmailResponseResource resource)">>,true,{{resource,<<"/">>,queue,<<"q.etm.mes.tmce_results">>},<0.522.0>,265919,true,{basic_message,{resource,<<"/">>,exchange,<<"x.etm.mes.tmce_results">>},[<<"test">>],{content,60,{'P_basic',<<"application/json">>,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},<<128,0,16,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110>>,rabbit_framing_amqp_0_9_1,[<<"{\"customerId\":8288025,\"campaignRunId\":66836,\"status\":\"Sent\",\"messageKey\":\"66836~8288025\",\"timeSent\":1618988531.764737000}">>]},<<201,223,49,27,101,206,136,61,160,111,163,73,226,167,54,31>>,false}}}},{'$gen_cast',{deliver,<<"TmceResultConsu...">>,...}},...]},...], []
2021-04-21 03:04:44.318 [error] <0.14.5> CRASH REPORT Process <0.14.5> with 0 neighbours exited with reason: no such process or port in call to gen_server2:terminate/3 line 1183
2021-04-21 03:04:44.318 [info] <0.11.5> supervisor: {<0.11.5>,rabbit_channel_sup}, errorContext: shutdown_error, reason: noproc, offender: [{pid,<0.14.5>},{id,channel},{mfargs,{rabbit_channel,start_link,[1,<0.5.5>,<0.12.5>,<0.5.5>,<<"10.11.161.117:63900 -> 172.24.22.82:5672">>,rabbit_framing_amqp_0_9_1,{user,<<"cron">>,[],[{rabbit_auth_backend_internal,none}]},<<"/">>,[{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true},{<<"basic.nack">>,bool,true},{<<"publisher_confirms">>,bool,true},{<<"consumer_cancel_notify">>,bool,true}],<0.6.5>,<0.13.5>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]
2021-04-21 03:04:44.318 [error] <0.11.5> Supervisor {<0.11.5>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.5.5>, <0.12.5>, <0.5.5>, <<"10.11.161.117:63900 -> 172.24.22.82:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"cron">>,[],[{rabbit_auth_backend_internal,none}]}, <<"/">>, [{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentica...">>,...},...], <0.6.5>, <0.13.5>) at <0.14.5> exit with reason noproc in context shutdown_error
2021-04-21 03:04:50.625 [info] <0.73.5> accepting AMQP connection <0.73.5> (10.11.161.117:42412 -> 172.24.22.82:5672)
2021-04-21 03:04:50.630 [info] <0.73.5> connection <0.73.5> (10.11.161.117:42412 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/'
2021-04-21 03:05:20.862 [error] <0.73.5> closing AMQP connection <0.73.5> (10.11.161.117:42412 -> 172.24.22.82:5672):
{inet_error,enotconn}
2021-04-21 03:05:20.863 [info] <0.131.5> Closing all channels from connection '10.11.161.117:42412 -> 172.24.22.82:5672' because it has been closed
2021-04-21 03:05:32.643 [info] <0.149.5> accepting AMQP connection <0.149.5> (10.11.161.117:10578 -> 172.24.22.82:5672)
2021-04-21 03:05:32.647 [info] <0.149.5> connection <0.149.5> (10.11.161.117:10578 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/'
2021-04-21 03:06:02.926 [error] <0.149.5> closing AMQP connection <0.149.5> (10.11.161.117:10578 -> 172.24.22.82:5672):
{inet_error,enotconn}
2021-04-21 03:06:02.927 [info] <0.206.5> Closing all channels from connection '10.11.161.117:10578 -> 172.24.22.82:5672' because it has been closed
2021-04-21 03:06:15.925 [info] <0.228.5> accepting AMQP connection <0.228.5> (10.11.161.117:41072 -> 172.24.22.82:5672)
2021-04-21 03:06:15.929 [info] <0.228.5> connection <0.228.5> (10.11.161.117:41072 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/'
2021-04-21 03:06:48.903 [error] <0.228.5> closing AMQP connection <0.228.5> (10.11.161.117:41072 -> 172.24.22.82:5672):
{inet_error,enotconn}
2021-04-21 03:06:48.904 [info] <0.279.5> Closing all channels from connection '10.11.161.117:41072 -> 172.24.22.82:5672' because it has been closed
2021-04-21 03:06:57.332 [info] <0.305.5> accepting AMQP connection <0.305.5> (10.11.161.117:14359 -> 172.24.22.82:5672)
2021-04-21 03:06:57.337 [info] <0.305.5> connection <0.305.5> (10.11.161.117:14359 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/'
2021-04-21 03:07:27.574 [error] <0.305.5> closing AMQP connection <0.305.5> (10.11.161.117:14359 -> 172.24.22.82:5672):
{inet_error,enotconn}
2021-04-21 03:07:27.575 [info] <0.357.5> Closing all channels from connection '10.11.161.117:14359 -> 172.24.22.82:5672' because it has been closed
2021-04-21 03:07:34.426 [info] <0.375.5> accepting AMQP connection <0.375.5> (10.11.161.117:48428 -> 172.24.22.82:5672)
2021-04-21 03:07:34.432 [info] <0.375.5> connection <0.375.5> (10.11.161.117:48428 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/'
2021-04-21 03:08:04.683 [error] <0.375.5> closing AMQP connection <0.375.5> (10.11.161.117:48428 -> 172.24.22.82:5672):
{inet_error,enotconn}
2021-04-21 03:08:04.684 [info] <0.432.5> Closing all channels from connection '10.11.161.117:48428 -> 172.24.22.82:5672' because it has been closed
2021-04-21 03:08:13.436 [info] <0.449.5> accepting AMQP connection <0.449.5> (10.11.161.117:23879 -> 172.24.22.82:5672)
2021-04-21 03:08:13.442 [info] <0.449.5> connection <0.449.5> (10.11.161.117:23879 -> 172.24.22.82:5672): user 'cron' authenticated and granted access to vhost '/'
2021-04-21 03:08:43.668 [error] <0.449.5> closing AMQP connection <0.449.5> (10.11.161.117:23879 -> 172.24.22.82:5672):
{inet_error,enotconn}
After this crash, the app is not able to recover and to communicate with RabbitMQ unless I restart the RabbitMQ server.
Additionally when I check the rabbitmq-server.service
I can see that it is active.
The same thing happens when I check with rabbitmq-diagnostics is_running
Asking node rabbit@rabbitmq-test1 for its status ...
RabbitMQ on node rabbit@rabbitmq-test1 is fully booted and running
So far I haven't been able to find a pattern as to when this happens. Any ideas as to why this might be happening?
Solution 1:
It turns out the issue was irrelevant to the rabbitMQ confiugration. It was due to a slowness between the communication of the app and MongoDB (which we were also using) that caused the connection between the app and RabbitMQ to time out.