HTCondor high availability

I am currently trying to make the job queue and submission mechanism of a local, isolated HTCondor cluster highly available. The cluster consists of 2 master servers (previously 1) and several compute nodes and a central storage system. DNS, LDAP and other services are provided by the master servers. HTCondor Version is 8.6.8 on Ubuntu 20.04.1 on all machines.

I followed the directions under https://htcondor.readthedocs.io/en/latest/admin-manual/high-availability.html . For the resulting config see below.

The spool directory (/clients/condor/spool) is located on a NFS v3 share every server has access to (/clients). All machines have a local user (r-admin) with uid and gid 1000 and the spool directory is owned by that user, since it is configured as the Condor user. Every other user is mapped via LDAP on every server including the storage cluster. On both master servers the user "condor" has the same uid and gid.

The HADLog updates regularly and doesn't state any errors. Only one master has the primary role at a time. ReplicationLog seems fine too.

However, there are several problems:

Let's assume master1 currently is the primary. The use of condor_q without any parameters only works on this machine and shows the correct job queue. On master2, the use of condor_q leads to a segmentation fault. If given the SCHEDD_NAME as argument ("condor_q master@"), there is output, but with the IP of master2 in it and without any jobs. Also the jobs don't start, they rest in Idle state.

Does anybody got any idea what could be wrong with the config or where I could find some more insights on this topic? Any help would be appreciated!


Edit

Below you can find the SchedLog entry on master1 when trying to run condor_q on master2:

10/08/20 11:50:30 (pid:47347) Number of Active Workers 0  
10/08/20 11:50:41 (pid:47347) AUTHENTICATE: handshake failed!    
10/08/20 11:50:41 (pid:47347) DC_AUTHENTICATE: authentication of <192.168.1.22:10977>
did not result in a valid mapped user name, which is 
required for this command (519 QUERY_JOB_ADS_WITH_AUTH), so aborting. 
10/08/20 11:50:41 (pid:47347) DC_AUTHENTICATE: reason for authentication failure: 
AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|
AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXGNYmKn)  

Master Daemons

DAEMON_LIST = MASTER, SCHEDD, COLLECTOR, NEGOTIATOR

Node Daemons

DAEMON_LIST = MASTER, STARTD

Local Config (/etc/condor/condor_config.local, all servers)

COLLECTOR_NAME = HPC
CENTRAL_MANAGER_HOST = master1.condor,master2.condor
UID_DOMAIN = condor
FILESYSTEM_DOMAIN = condor

ENABLE_HISTORY_ROTATION = TRUE
MAX_HISTORY_LOG = 2000000000
MAX_HISTORY_ROTATIONS = 100

EMAIL_DOMAIN = condor

ENABLE_IPV6 = FALSE

CONDOR_IDS = 1000.1000

QUEUE_SUPER_USERS = root, r-admin

CONDOR_ADMIN = root@condor

SOFT_UID_DOMAIN = TRUE

ALLOW_READ = *, $(CONDOR_HOST), $(IP_ADDRESS), $(CENTRAL_MANAGER_HOST)
ALLOW_WRITE = *, $(CONDOR_HOST), $(IP_ADDRESS), $(CENTRAL_MANAGER_HOST)
ALLOW_ADMINISTRATOR = *, $(CONDOR_HOST), $(IP_ADDRESS), $(CENTRAL_MANAGER_HOST)

HA Config (/etc/condor/config.d/ha.conf, only master servers)

## HA Konfiguration

## Shared Job Queue
MASTER_HA_LIST = SCHEDD
SPOOL = /clients/condor/spool
HA_LOCK_URL = file:/clients/condor/spool
VALID_SPOOL_FILES = $(VALID_SPOOL_FILES) SCHEDD.lock
SCHEDD_NAME = master@


## Shared Negotiator and Collector
HAD_USE_SHARED_PORT = TRUE
HAD_LIST = master1.condor:$(SHARED_PORT_PORT),master2.condor:$(SHARED_PORT_PORT)

REPLICATION_USE_SHARED_PORT = TRUE
REPLICATION_LIST = master1.condor:$(SHARED_PORT_PORT),master2.condor:$(SHARED_PORT_PORT)

HAD_USE_PRIMARY = TRUE

HAD_CONTROLLEE = NEGOTIATOR
MASTER_NEGOTIATIOR_CONTROLLER = HAD

DAEMON_LIST = $(DAEMON_LIST), HAD, REPLICATION

HAD_USE_REPLICATION = TRUE

STATE_FILE = $(SPOOL)/Accountantnew.log

MASTER_HAD_BACKOFF_CONSTANT = 360

With the help of the mailing list and some experimenting I managed to resolve the issue.

Since there are two master servers now, they need not only a shared filesystem for spooling (as described in the HA guide and shown above), but also one for authentication. At least configuring FS_REMOTE as auth mechanism is one way of doing that:

SEC_DEFAULT_AUTHENTICATION_METHODS = FS, FS_REMOTE
FS_REMOTE_DIR = /clients/condor/sec

After the daemons authenticated correctly, jobs started an everything seemed to be fine. condor_q produced the correct output and failover worked as expected. However, jobs were not removed from the job queue after they finished and were requeued:

SchedLog: "SetEffectiveOwner security violation: setting owner to r-admin when active owner is "condor""
ShadowLog: "SetEffectiveOwner(r-admin) failed with errno=13: Permission denied."

The error message stated something about the user condor, which should not be involved at all, since CONDOR_IDS has been set to 1000.1000 (r-admin). There were no files, processes, or anything else really, owned by or with reference to the user "condor".

Turns out that condor somehow still references this username internally (see below), which seems to be new after the upgrade. After adding "condor" to QUEUE_SUPER_USERS, the problem was resolved and jobs exited normally.

04/22/21 23:50:25 (fd:19) (pid:2200) (D_COMMAND) Calling HandleReq <handle_q> (0) for command 1112 (QMGMT_WRITE_CMD) from condor@child <XXX.XXX.XXX.XXX:22171>
04/22/21 23:50:25 (fd:19) (pid:2200) (D_SYSCALLS) Got request #10030
04/22/21 23:50:25 (fd:19) (pid:2200) (D_ALWAYS) SetEffectiveOwner security violation: setting owner to r-admin when active owner is "condor"