Linux workstations only authenticate on one domain controller

I have 3 domain controllers

192.168.1.6    Server 2016   [PDC equivalent]  Site 1
192.168.1.7    Server 2016                     Site 1
192.168.31.10  Server 2016                     Site 2

and a mix of Windows10, Mac and Debian 10 linux workstations. The windows and macs never have an issue. Recently, (last Wednesday) seemingly at the same time I installed the October windows updates on the domain controllers, users have been intermittently unable to login to the linux workstations.

Logging in from the desktop gives "Incorrect password". Logging in via ssh just closes the connection after a few seconds (as opposed to an actual incorrect password which will say Permission denied, please try again:

username@PC:~$ ssh [email protected]
[email protected]'s password:
Connection closed by 192.168.1.195 port 22

It turned out that forcing the workstations (via /etc/hosts) to resolve our domain (example.com) to the primary DC (192.168.1.6 example.com) fixed it.

Obviously this is just a workaround. I've been unable to find any failed auth events in the windows eventlogs on any DC. It's as if the auth attempt doesn't even reach the DC. I also can't find any results online.

Auth logs on a workstation (192.168.1.125):

Oct 29 10:09:45 exampleLXWS5 sshd[15065]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=192.168.1.125  user=john.smith
Oct 29 10:09:45 exampleLXWS5 sshd[15065]: pam_sss(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=192.168.1.125 user=john.smith
Oct 29 10:09:45 exampleLXWS5 sshd[15065]: pam_sss(sshd:auth): received for user john.smith: 17 (Failure setting user credentials)
Oct 29 10:09:47 exampleLXWS5 sshd[15065]: Failed password for john.smith from 192.168.1.125 port 54758 ssh2
Oct 29 10:09:52 exampleLXWS5 sshd[15065]: pam_sss(sshd:auth): authentication success; logname= uid=0 euid=0 tty=ssh ruser= rhost=192.168.1.125 user=john.smith
Oct 29 10:09:57 exampleLXWS5 sshd[15065]: pam_sss(sshd:account): Access denied for user john.smith: 4 (System error)
Oct 29 10:09:57 exampleLXWS5 sshd[15065]: Failed password for john.smith from 192.168.1.125 port 54758 ssh2
Oct 29 10:09:57 exampleLXWS5 sshd[15065]: fatal: Access denied for user john.smith by PAM account configuration [preauth]
Oct 29 10:09:57 exampleLXWS5 sshd[15065]: PAM 1 more authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=192.168.1.125  user=john.smith

Here are some more logs with debug_level = 5 in sssd.conf for a failed login attempt.

==> /var/log/sssd/sssd_EXAMPLE.COM.log <==
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [dp_get_account_info_handler] (0x0200): Got request for [0x3][BE_REQ_INITGROUPS][[email protected]]
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [sysdb_set_entry_attr] (0x0200): Entry [[email protected],cn=users,cn=EXAMPLE.COM,cn=sysdb] has set [ts_cache] attrs.
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [sysdb_set_entry_attr] (0x0200): Entry [[email protected],cn=users,cn=EXAMPLE.COM,cn=sysdb] has set [ts_cache] attrs.
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [dp_pam_handler] (0x0100): Got request with the following data
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): command: SSS_PAM_AUTHENTICATE
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): domain: EXAMPLE.COM
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): user: [email protected]
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): service: sshd
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): tty: ssh
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): ruser:
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): rhost: 192.168.1.116
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): authtok type: 1
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): newauthtok type: 0
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): priv: 1
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): cli_pid: 15728
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): logon name: not set
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [krb5_auth_send] (0x0100): Home directory for user [[email protected]] not known.
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'AD'
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [resolve_srv_send] (0x0200): The status of SRV lookup is resolved
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [be_resolve_server_process] (0x0200): Found address for server domaincontroller1.EXAMPLE.COM: [192.168.1.6] TTL 3600

==> /var/log/sssd/krb5_child.log <==
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [unpack_buffer] (0x0100): cmd [241] uid [1860816111] gid [1860800513] validate [true] enterprise principal [true] offline [false] UPN [[email protected]]
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [unpack_buffer] (0x0100): ccname: [FILE:/tmp/krb5cc_1860816111_XXXXXX] old_ccname: [FILE:/tmp/krb5cc_1860816111_SzxXi9] keytab: [/etc/krb5.keytab]
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [check_use_fast] (0x0100): Not using FAST.
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [switch_creds] (0x0200): Switch user to [1860816111][1860800513].
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [switch_creds] (0x0200): Switch user to [0][0].
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [privileged_krb5_setup] (0x0080): Cannot open the PAC responder socket
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [become_user] (0x0200): Trying to become user [1860816111][1860800513].
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [try_open_krb5_conf] (0x0080): Cannot open /etc/krb5.conf [2]: No such file or directory
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [set_lifetime_options] (0x0100): No specific renewable lifetime requested.
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [set_lifetime_options] (0x0100): No specific lifetime requested.
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [set_canonicalize_option] (0x0100): Canonicalization is set to [true]
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [sss_send_pac] (0x0040): sss_pac_make_request failed [-1][2].
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [validate_tgt] (0x0040): sss_send_pac failed, group membership for user with principal [john.smith\@[email protected]] might not be correct.
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [switch_creds] (0x0200): Switch user to [1860816111][1860800513].
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [switch_creds] (0x0200): Already user [1860816111].
(Fri Oct 29 13:56:02 2021) [[sssd[krb5_child[15730]]]] [k5c_send_data] (0x0200): Received error code 0

==> /var/log/sssd/sssd_EXAMPLE.COM.log <==
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [child_sig_handler] (0x0100): child [15730] finished successfully.
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [fo_set_port_status] (0x0100): Marking port 389 of server 'domaincontroller1.EXAMPLE.COM' as 'working'
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [set_server_common_status] (0x0100): Marking server 'domaincontroller1.EXAMPLE.COM' as 'working'
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [sysdb_set_entry_attr] (0x0200): Entry [[email protected],cn=users,cn=EXAMPLE.COM,cn=sysdb] has set [cache, ts_cache] attrs.
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [sysdb_set_entry_attr] (0x0200): Entry [[email protected],cn=users,cn=EXAMPLE.COM,cn=sysdb] has set [cache, ts_cache] attrs.
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [dp_pam_handler] (0x0100): Got request with the following data
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): command: SSS_PAM_ACCT_MGMT
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): domain: EXAMPLE.COM
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): user: [email protected]
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): service: sshd
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): tty: ssh
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): ruser:
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): rhost: 192.168.1.116
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): authtok type: 0
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): newauthtok type: 0
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): priv: 1
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): cli_pid: 15728
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [pam_print_data] (0x0100): logon name: not set
(Fri Oct 29 13:56:02 2021) [sssd[be[EXAMPLE.COM]]] [ad_gpo_get_som_attrs_done] (0x0040): no attrs found for SOM; try next SOM

==> /var/log/sssd/gpo_child.log <==
(Fri Oct 29 13:56:08 2021) [[sssd[gpo_child[15731]]]] [copy_smb_file_to_gpo_cache] (0x0020): smbc_getFunctionOpen failed [110][Connection timed out]
(Fri Oct 29 13:56:08 2021) [[sssd[gpo_child[15731]]]] [perform_smb_operations] (0x0020): copy_smb_file_to_gpo_cache failed [110][Connection timed out]
(Fri Oct 29 13:56:08 2021) [[sssd[gpo_child[15731]]]] [main] (0x0020): perform_smb_operations failed.[110][Connection timed out].
(Fri Oct 29 13:56:08 2021) [[sssd[gpo_child[15731]]]] [main] (0x0020): gpo_child failed!

==> /var/log/sssd/sssd_EXAMPLE.COM.log <==
(Fri Oct 29 13:56:08 2021) [sssd[be[EXAMPLE.COM]]] [gpo_cse_done] (0x0020): ad_gpo_parse_gpo_child_response failed: [22][Invalid argument]
(Fri Oct 29 13:56:08 2021) [sssd[be[EXAMPLE.COM]]] [ad_gpo_cse_done] (0x0040): Unable to retrieve policy data: [22](Invalid argument}
(Fri Oct 29 13:56:08 2021) [sssd[be[EXAMPLE.COM]]] [ad_gpo_access_done] (0x0040): GPO-based access control failed.
(Fri Oct 29 13:56:08 2021) [sssd[be[EXAMPLE.COM]]] [child_sig_handler] (0x0020): child [15731] failed with status [1].

Solution 1:

After a lot more digging through logs (debug level 7 on sssd) and googling, I still don't know what the root cause is, but I found it was throwing an error on accessing a particular group policy object. I checked the object permissions in \\domain.com\SYSVOL\.... and it all looked fine.

This setting makes the problem go away in a slightly more palatable way than pinning the domain to a single DC in /etc/hosts:

in /etc/sssd/sssd.conf under the [domain/MYDOMAIN.COM] section, I added:

ad_gpo_access_control = permissive

Full file contents:

# cat /etc/sssd/sssd.conf

[sssd]
#debug_level = 7
domains = MYDOMAIN.COM
config_file_version = 2
services = nss, pam

[domain/MYDOMAIN.COM]
#debug_level = 7
# Added ad_gpo_access_control = permissive on 2021-11-01 to fix AD login issue
ad_gpo_access_control = permissive
ad_domain = MYDOMAIN.COM
krb5_realm = MYDOMAIN.COM
realmd_tags = manages-system joined-with-adcli
cache_credentials = True
id_provider = ad
krb5_store_password_if_offline = True
default_shell = /bin/bash
ldap_id_mapping = True
use_fully_qualified_names = False
fallback_homedir = /home/%u@%d
access_provider = ad