mount.ocfs2: Transport endpoint is not connected while mounting...?

Oh yeah! Problem solved.

Pay attention to the UUID:

# mounted.ocfs2 -d
Device                FS     Stack  UUID                              Label
/dev/sdb              ocfs2  o2cb   12963EAF4E16484DB81ECB0251177C26  ocfs2_drbd1
/dev/drbd1            ocfs2  o2cb   12963EAF4E16484DB81ECB0251177C26  ocfs2_drbd1

but:

# ls -l /sys/kernel/config/cluster/cpc/heartbeat/
drwxr-xr-x 2 root root    0 Dec 24 22:53 72EF09EA3D0D4F51BDC00B47432B1EB2

This could happen because I "accidentally" force re-formated the OCFS2 volume. The problem I'm facing with is similar to this on the Ocfs2-user mailing list.

This is also the reason for below error:

ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

because ocfs2_hb_ctl cannot find the device with UUID 72EF09EA3D0D4F51BDC00B47432B1EB2 in the /proc/partitions.

One idea comes to my mind: Can I change the UUID of a OCFS2 volume?

Looking through the tunefs.ocfs2 man page:

Usage: tunefs.ocfs2 [options] <device> [new-size]
       tunefs.ocfs2 -h|--help
       tunefs.ocfs2 -V|--version
[options] can be any mix of:
        -U|--uuid-reset[=new-uuid]

so I do the following command:

# tunefs.ocfs2 --uuid-reset=72EF09EA3D0D4F51BDC00B47432B1EB2 /dev/drbd1
WARNING!!! OCFS2 uses the UUID to uniquely identify a file system. 
Having two OCFS2 file systems with the same UUID could, in the least, 
cause erratic behavior, and if unlucky, cause file system damage. 
Please choose the UUID with care.
Update the UUID ?yes

Verify:

# tunefs.ocfs2 -Q "%U\n" /dev/drbd1 
72EF09EA3D0D4F51BDC00B47432B1EB2

Tried to kill the heartbeat region again to see what happens:

# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2
# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2
72EF09EA3D0D4F51BDC00B47432B1EB2: 6 refs

Keep killing until I see the 0 refs then turn off the cluster:

# /etc/init.d/o2cb offline cpc
Stopping O2CB cluster cpc: OK

and stop it:

# /etc/init.d/o2cb stop
Stopping O2CB cluster cpc: OK
Unloading module "ocfs2": OK
Unmounting ocfs2_dlmfs filesystem: OK
Unloading module "ocfs2_dlmfs": OK
Unmounting configfs filesystem: OK
Unloading module "configfs": OK

Re-starting to see if the new node was updated:

# /etc/init.d/o2cb start
Loading filesystem "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Starting O2CB cluster cpc: OK

# ls -l /sys/kernel/config/cluster/cpc/node/
total 0
drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR022-293.localdomain
drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR233NTC-3145.localdomain

OK, on the peer node (192.168.2.93), tried to start the OCFS2:

# /etc/init.d/ocfs2 start
Starting Oracle Cluster File System (OCFS2)                [  OK  ]

Thanks to Sunil Mushran because this thread helped me solve the problem.

The lessons are:

  1. The IP address, port, ... can be only changed when the cluster is offlined. See the FAQ.
  2. Never force a re-format a OCFS2 volume.