Why is Mac OSX Lion losing login/network credentials?

Symptoms

At work we have OSX 10.7.3 installed and every once in a while I will see the following behaviors:

  1. If the the screen is locked, then multiple tries of the same user/pass are not accepted.

  2. If the screen is unlocked, then opening a new bash term may yield prompts such as:

    `I have no name$`
    

    or

    lkyrala$ ssh lkyrala@ah-lkyrala2u 
    You don't exist, go away!
    

Even when our Macs are working normally, everyone here has to log in twice. The first time after boot always fails, but the second time (with the same password, not changing anything, just pressing enter again) succeeds. Weird?

Workarounds

There are some workarounds that resolve the immediate problem, but don't prevent it from happening again:

  1. wait (maybe an hour or two) and the problems sometimes go away by themselves.

  2. kill 'opendirectoryd' and let it restart. (from Apple Support Communities: User ID (not data) deleted suddenly?)

  3. hold the power button to reset the computer

UPDATE 10/4/2012

Our net admins suspect that lockd is implicated. lockd apparently uses UDP and when the network is congested, packets are lost, which results in the hanging behavior. They are looking at steps to decrease the congestion. If the file access in question happens to be the Active Directory authentication handle, then all of these different pieces start to fit together.

Discussion

Now, the evidence above points me to something screwy with opendirectory and login credentials. Some other people report having these login problems, but it's hard to determine where the actual problem is (Mac, or network environment?).

I should add that most of the network are Windows machines, but we have quite a few Macs and Linux machines as well, but I'm not sure of the details of how the network auth is mapped from various domains to others... all I know is that our network credentials work in Windows domains as well as mac and linux logins -- so something is connecting separate systems, or using the same global auth system.

Additional Detail

Unfortunately, I didn't set up this Mac, our IT dept did, so I'm not entirely sure how authentication works. I do know that it is a network login (which is unusual in my experience with Macs, they usually have local accounts which connect to external resources) but here, our home folder is on the network, not local. Under my linux installs, connecting to the network involves yp/NIS, (which allows us to automount parts of our network filesystem from any machine), and the opendirectoryd.log seems to confirm this is involved...

/var/log/opendirectoryd.log* shows:

2012-04-04 01:29:12.370 EDT - ddddd.dddddd.dddddd.dddddd - Client: automount, UID: 0, EUID: 0, GID: 0, EGID: 0
2012-04-04 01:29:12.370 EDT - ddddd.dddddd.dddddd.dddddd, Node: /NIS/Domain, Module: nis - could not determine map for rectype 'mounts' attribute 'byname'
2012-04-04 01:32:04.504 EDT - failed to get YP map list

It looks like the domain 'Domain' is being lost somehow. Why is the UID == 0 here? That seems bad, doesn't it?

I know under Linux a while back, I discovered that the NIS broadcast had been disabled or blocked, so I gathered the IPs from someone and set the ypserver IPs manually in /etc/yp.conf and that fixed drops in Linux. Maybe something similar is going on here?

I tried looking up information in Mac's yp man pages:

  • BSD System Manager's Manual - YP

And then found this post detailing where the existing servers are set:

  • Apple Support Communities: Network authentication using NIS fails

However, checking the ypserver settings showed that both server IPs were correctly set for NIS.

Checking /var/log/system.log shows:

Aug 28 00:30:08 mymac ypbind[22991]: direct: sendto: No route to host
Aug 28 00:30:08 mymac ypbind[22991]: direct: sendto: No route to host
Aug 28 00:30:08 mymac ypbind[22991]: Can't contact any servers listed in /var/yp/binding/Domain.ypservers.  Aborting
Aug 28 00:30:08 mymac com.apple.launchd[1] (com.apple.nis.ypbind[22991]): Exited with code: 1
Aug 28 00:30:08 mymac com.apple.launchd[1] (com.apple.nis.ypbind): Throttling respawn: Will start in 10 seconds
Aug 28 00:30:08 mymac xpchelper[22990]: getpwuid_r() failed for UID: uuuu, ret: 0, errno: 0

So this makes me suspect the nfs.conf settings, etc. Some others believe that this is due to something in lockd.

Research

  • 10.6 Server: How to get NFS disk serving working properly

  • nfs.conf (and lockd)

  • launchd

  • mount_nfs - "An NFS server shouldn't loopback-mount its own exported file systems because it's fundamentally prone to deadlock."

  • rpc.lockd - "The current implementation serialises locks requests that could be shared."

Reports of Similar Issues

  • Xgrid agents not rejoining after server outage (very similar problem!)

  • NFS drops under heavy load

  • User can't unlock screensaver when using LDAP on OSX Lion

  • Versions working erratically, with xpchelper errors

  • Large number of related errors reported but not categorized on Apple boards


Did you bind the mac to an OSX Server or to Active Directory? If the latter check if the domain ends in .local. If it does there are some known problems with multicast interference on OSX. The process here may work for you: http://www.macwindows.com/TIP-Lion-dot-local-AD-disable-multicast.html

Some Macs just plain aren't happy on an AD network. I've had several iMacs with essentially identical specs and most were fine but 2 kept losing connectivity with the domain controller and had constant kerberos ticket related issues. In this case breaking the mac from the domain and then reconnecting it using Centrify Express resolved the issue. You can find the agent on their website here: http://www.centrify.com/express/free-active-directory-tools-for-linux-mac.asp#agents