Reboot hangs 100% - possibly in mountall

UPDATE: It seems mountall is hanging inside the routine emit_event(), which it calls after / is remounted to emit an event to that effect. Inside emit_event, it calls ply_boot_client_flush(), then constructs the env array, calls upstart_emit_event(), then dbus_pending_call_block(). And there it hangs. So any ideas why dbus_pending_call_block would hang indefinitely? Broken plymouth? dbus? upstart? Any suggestions for fixes or further diagnostics?

Reboot of my Ubuntu 10.04 LTS, 64bit AMD machine hangs 100%. The drive access light is off, but the alt-sysreq keys do work. The hardware is a Lenovo W700ds laptop. Now, I apologize in advance, because I'm very limited in the information about the system I have available, and in what I can do with it (because it will not boot). I can boot from the 10.04 CD - using it like a rescue disk. I can fsck, mount and read & write to my partitions - they are fine. I already tried reformatting my swap with mkswap. I have 4 ext4 partitions on my system: sda1 is /, sda2 is /usr, sda3 is /home, and a 4th that I use for data storage /sdb1 (is the entire disk, mounts at the mountpoint /hdb which I created). There is also /sda4 which is swap. Right now I am writing this from a browser I have opened in the 'rescue session' from the 10.04 LTS install CD.

I would GREATLY appreciate suggestions/comments on what I could do to help diagnose what is hanging, why, and what I could do to fix it. I've done a websearch already, but found nothing new along these lines (some 1-1.5 year old bug reports with similar symptoms, but their fixes did not work).

I installed 10.04 on a new disk around the first of July, then used aptitude to bring everything up to date. Since then I've been installing LOTS of packages (I'll attach the dpkg log below). With sda being 750GB (/ 20GB, /usr 80GB) I had lots of space to install packages that I 'might someday use'. I wonder if its one of these packages I installed that has screwed up my system? I installed kernel 2.6.32-32-generic and rebooted, but have installed many more packages since. I reboot this machines as rarely as possible - preferring to hibernate it while going from place to place. Lately though, I noticed some strange behavior associated with de-hibernation: when the system would de-hibernate it brings up the gnome screen saver with the a password needed to unlock - well, it would not recognize my password! I had to alt-F1, log in as root, and kill the screen saver. Then all would be fine, or so it seemed. Also, upon de-hibernation I would frequently see for a short while blinking colorful garbage on the screen. It would go away, so I didn't try to find the cause. Another possibly relevant point is that I needed to use "nomodeset" in the installation of 10.04, and when bringing up the rescue shell from that same CD, if I use only nomodeset it will eventually hang with a flashing NumLock LED or Caps Lock LED (crash?), but if I also use "noapic nolapic acpi=off" then it comes up ok. I've tried these options with my system to see if they cure the boot hang problem - they do not.

This is a machine I use for work as well as for nearly everything else, so getting it to boot again is a TOP priority. /home is intact, which is good. But I'm about at my wits end in trying to diagnose (much less fix) this cause of the hung boot.

I boot the system, and it starts running the mountall config script in /etc/init/mountall.conf. I see output from mountall running fsck - 4 lines that say: fsck from util-linux-ng 2.17.2 (thats one per ext4 partition). Then there are 4 more lines from fsck informing the user that the partitions were found to be "clean". And that is it - everything just stops. The drive activity LED goes off. I can use the alt-sysreq keys, but they have so far not proven useful. I saw a bug report where one user used alt-sysreq-i to kill process and it dropped him into a shell. For me, it does say it has killed processes (udev and udev-bridge and plymouth, says its respawning udev, etc), but I do not get any shell.

I have been trying to determine what exactly is hanging. To this end, I've tinkered with /etc/init/mountall.conf. I have added echo lines, and I have added the -v (verbose) option to the exec of mountall. No echo lines after the exec of mountall are shown, so this may mean mountall is hanging. Or, it may not be displaying the last of the output - in which case mountall may have exited and something else may be hanging. I note that alt-sysreq-i does not say mountall is killed. I've tried to narrow down what the system might be hanging on by commenting out sda3 (/home), swap and sdb1 (/hdb) from fstab, but it still hangs.

There is alot I can do myself, but feel like I'm in over my head here. I would like to, for example, get the source code for mountall, add printed flags, recompile and stick it on my system - to narrow down A) if mountall is actually hanging, and B) what is it hanging on. BUT, I can not boot my machine to a shell from which to compile within - and the rescue disk environment is only 2.6.32-28-generic #55 - so it would not match my system. I'd like to remove or reinstall packages, but again, I can not boot my machine and do this.

(my dpkg log file is several MBs, so I will attach it in a following dialog box)

Thanks, Greg


Solution 1:

Denwerko: I have done nothing to my machine that should have produced this result. It was pretty stable under Ubuntu 9.10 - never had anything like this happen. All of the tinkering with source, recompiling things - its all been user-space code. I have not been tinkering at all with the OS. Nor have I installed any OS-space code outside of the standard channels (aptitude/synaptic package manager, deb packages obtainedthrough those tools). Greg yesterday

However, I've obtained the source code to mountall 2.15.3 and got it to compile in the rescue environment, after 5 installs (libnih-dev, libnihdbus-dev, lindbus-1-dev, linudev-dev, libplymouth-dev). I've added debugging prints in the code via nih_info() calls, and I've made the spawns that execute fsck blocking instead of non-blocking. I'm working on the theory that mountall is crashing somewhere (or nih, or dbus or plymouth...). I do not seem to get output to the same place in the code each run, but it seems to stop sometime after the remount of /dev/sda1 to / - in the mounted() routine. Greg yesterday

I've also been doing dpkg -r of packages via chroot as you suggested, and that seems to work (except for one deinstall script that wanted to do something with /proc). I deinstalled wine, and the 32bit compatability packages it needed (lib32nss, ia32lib, lib32v4l, etc) and several ibus packages that are not installed on in the rescue environment (some ibus packages are, and I was carefull not to remove those)-removed plasma-widget-kimpanel-backend-ibus, ibus-qt4, ibus-qt1. None of this affected the problem, so I've removed more packages I don't need now (wx widget & jdk packages, etc)-no effect

UPDATE: It seems mountall is hanging inside the routine emit_event(), which it calls after / is remounted to emit and event to that effect. Inside emit_event, it calls ply_boot_client_flush(), then constructs the env array, calls upstart_emit_event(), then dbus_pending_call_block(). And there it hangs. So any ideas why dbus_pending_call_block would hang indefinitely? Broken plymouth? dbus? upstart? Any suggestions for fixes or further diagnostics?

SOLUTION So, it seems I had installed cloud-init and cloud-utils because I though someday I might want to play with it. Will, turns out cloud-init screws with the ureadahead configuration, and launches when the dbus event 'mounted /' happens, which caused my system to hang as soon as it sent out that dbus message, which happens after / gets remounted from ro to r/w. I deinstalled cloud-init and cloud-utils and all seems ok now. Except, I'm sleepy and have lost 24hours of my life :\