Corosync-Pacemaker no split brain
I'm trying to set up a cluster of two nodes using CentOS 7, Corosync, Pacemaker and pcsd. I can migrate resources manually from one node to another, but if I turn off primary node (by unplugging the power cable), secondary node does not become primary. I have 2 network interfaces. eno1 10.211.0.0/24 for default route and VRRP and eno2 10.255.255.0/30 for Corosync and Pacemaker.
Here are configs:
pcs config show
Cluster Name: PBX
Corosync Nodes:
pbx-1no pbx-2no
Pacemaker Nodes:
pbx-1no pbx-2no
Resources:
Master: PBX_DRBD_master
Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
Resource: PBX_DRBD (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=asterisk_DRBD
Operations: demote interval=0s timeout=90 (PBX_DRBD-demote-interval-0s)
monitor interval=10s on-fail=restart role=Master timeout=20s (PBX_DRBD-monitor-interval-10s)
monitor interval=20s on-fail=restart role=Slave timeout=20s (PBX_DRBD-monitor-interval-20s)
notify interval=0s timeout=90 (PBX_DRBD-notify-interval-0s)
promote interval=0s timeout=90 (PBX_DRBD-promote-interval-0s)
reload interval=0s timeout=30 (PBX_DRBD-reload-interval-0s)
start interval=0s on-fail=restart timeout=240s (PBX_DRBD-start-interval-0s)
stop interval=0s on-fail=block timeout=100s (PBX_DRBD-stop-interval-0s)
Resource: PBX_FS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd0 directory=/mnt/drbd0 fstype=ext4
Operations: monitor interval=20s on-fail=restart timeout=40s (PBX_FS-monitor-interval-20s)
notify interval=0s timeout=60s (PBX_FS-notify-interval-0s)
start interval=0s on-fail=restart timeout=60s (PBX_FS-start-interval-0s)
stop interval=0s on-fail=block timeout=60s (PBX_FS-stop-interval-0s)
Resource: PBX_IP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=24 iflabel=0 ip=10.211.0.10 nic=eno1
Operations: monitor interval=10s on-fail=restart timeout=20s (PBX_IP-monitor-interval-10s)
start interval=0s on-fail=restart timeout=20s (PBX_IP-start-interval-0s)
stop interval=0s on-fail=block timeout=20s (PBX_IP-stop-interval-0s)
Resource: PBX_ROUTE_default (class=ocf provider=heartbeat type=Route)
Attributes: destination=0.0.0.0/0 family=ip4 gateway=10.211.0.1 source=10.211.0.10
Operations: monitor interval=10s on-fail=restart timeout=20s (PBX_ROUTE_default-monitor-interval-10s)
reload interval=0s timeout=20s (PBX_ROUTE_default-reload-interval-0s)
start interval=0s on-fail=restart timeout=20s (PBX_ROUTE_default-start-interval-0s)
stop interval=0s on-fail=ignore timeout=20s (PBX_ROUTE_default-stop-interval-0s)
Resource: PBX_mariadb (class=systemd type=mariadb.service)
Operations: monitor interval=100s on-fail=ignore timeout=60s (PBX_mariadb-monitor-interval-100s)
start interval=0s on-fail=ignore timeout=100s (PBX_mariadb-start-interval-0s)
stop interval=0s on-fail=ignore timeout=100s (PBX_mariadb-stop-interval-0s)
Resource: PBX_httpd (class=systemd type=httpd.service)
Operations: monitor interval=100s on-fail=ignore timeout=60s (PBX_httpd-monitor-interval-100s)
start interval=0s on-fail=ignore timeout=100s (PBX_httpd-start-interval-0s)
stop interval=0s on-fail=ignore timeout=100s (PBX_httpd-stop-interval-0s)
Resource: PBX_asterisk (class=systemd type=asterisk.service)
Operations: monitor interval=100s on-fail=ignore timeout=60s (PBX_asterisk-monitor-interval-100s)
start interval=0s on-fail=ignore timeout=100s (PBX_asterisk-start-interval-0s)
stop interval=0s on-fail=ignore timeout=100s (PBX_asterisk-stop-interval-0s)
Clone: ping_internal-clone
Resource: ping_internal (class=ocf provider=pacemaker type=ping)
Attributes: dampen=5s host_list="10.255.255.1 10.255.255.2" multiplier=1000
Operations: monitor interval=10 timeout=60 (ping_internal-monitor-interval-10)
start interval=0s timeout=60 (ping_internal-start-interval-0s)
stop interval=0s timeout=20 (ping_internal-stop-interval-0s)
Stonith Devices:
Resource: hpilo1 (class=stonith type=fence_ilo5)
Attributes: ipaddr=ilo1.emergency login=admin passwd=11111 pcmk_host_list=pbx-1no
Operations: monitor interval=60s (hpilo1-monitor-interval-60s)
Resource: hpilo2 (class=stonith type=fence_ilo5)
Attributes: ipaddr=ilo2.emergency login=admin passwd=11111 pcmk_host_list=pbx-2no
Operations: monitor interval=60s (hpilo2-monitor-interval-60s)
Fencing Levels:
Location Constraints:
Resource: PBX_FS
Enabled on: pbx-1no (score:INFINITY) (role: Started) (id:cli-prefer-PBX_FS)
Resource: hpilo1
Disabled on: pbx-1no (score:-INFINITY) (id:location-hpilo1-pbx-1no--INFINITY)
Resource: hpilo2
Disabled on: pbx-2no (score:-INFINITY) (id:location-hpilo2-pbx-2no--INFINITY)
Ordering Constraints:
promote PBX_DRBD_master then start PBX_FS (kind:Mandatory) (id:order-PBX_DRBD_master-PBX_FS-mandatory)
start PBX_FS then start PBX_IP (kind:Mandatory) (id:order-PBX_FS-PBX_IP-mandatory)
start PBX_IP then start PBX_ROUTE_default (kind:Mandatory) (id:order-PBX_IP-PBX_ROUTE_default-mandatory)
start PBX_FS then start PBX_asterisk (kind:Mandatory) (id:order-PBX_FS-PBX_asterisk-mandatory)
start PBX_FS then start PBX_mariadb (kind:Mandatory) (id:order-PBX_FS-PBX_mariadb-mandatory)
start PBX_mariadb then start PBX_httpd (kind:Mandatory) (id:order-PBX_mariadb-PBX_httpd-mandatory)
Colocation Constraints:
PBX_ROUTE_default with PBX_IP (score:INFINITY) (id:colocation-PBX_ROUTE_default-PBX_IP-INFINITY)
PBX_FS with PBX_DRBD_master (score:INFINITY) (with-rsc-role:Master) (id:colocation-PBX_FS-PBX_DRBD_master-INFINITY)
PBX_IP with PBX_FS (score:INFINITY) (id:colocation-PBX_IP-PBX_FS-INFINITY)
PBX_asterisk with PBX_FS (score:INFINITY) (id:colocation-PBX_asterisk-PBX_FS-INFINITY)
PBX_mariadb with PBX_FS (score:INFINITY) (id:colocation-PBX_mariadb-PBX_FS-INFINITY)
PBX_httpd with PBX_FS (score:INFINITY) (id:colocation-PBX_httpd-PBX_FS-INFINITY)
Ticket Constraints:
Alerts:
Alert: smtp_alert (path=/var/lib/pacemaker/alert_smtp.sh)
Recipients:
Recipient: smtp_alert-recipient (value=hidden)
Resources Defaults:
resource-stickiness=100
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: PBX
dc-version: 1.1.23-1.el7_9.1-9acf116022
have-watchdog: false
last-lrm-refresh: 1613632161
no-quorum-policy: ignore
stonith-enabled: true
Quorum:
Options:
Corosync.conf
totem {
version: 2
cluster_name: PBX
secauth: on
transport: udpu
token: 5000
}
nodelist {
node {
ring0_addr: pbx-1no
nodeid: 1
}
asterisk.DRBD
resource asterisk_DRBD {
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
}
disk {
on-io-error detach;
}
net {
protocol C;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri call-pri-lost-after-sb;
cram-hmac-alg "sha1";
shared-secret "something";
}
on pbx-1 {
device /dev/drbd0;
disk /dev/md3;
address 10.255.255.1:7789;
meta-disk internal;
}
on pbx-2 {
device /dev/drbd0;
disk /dev/md3;
address 10.255.255.2:7789;
meta-disk internal;
}
}
node {
ring0_addr: pbx-2no
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}
At first I thought about routes, because when eno2 is down, there is no route for 10.255.255.0/30, and it goes through default gateway. But I made a rule on router, which drops these packets and it has no result. What coult be the problem?
Solution 1:
The problem was in IP address. When main node shuts down, ethernet link on the secondary node also turns off and there is no IP. So i made a script, which makes ifdown/ifup if there is no IP on the intereface