"service mysqld stop" times out (then found out that "mysqld dead but subsys locked")
I installed mysql and server via yum on my 64bit CentOS 5 server. It starts up fine but when I try to stop it it stalls out and then I have to "Ctrl-C" it. Then I run a "service mysqld status" and it shows:
mysqld dead but subsys locked
I run ps aux and mysql is nowhere to be found. Starting mysqld again via "service mysqld start" works fine. Attempting to stop it creates the same problem.
I then realized that /var/lock/subsys/mysqld
still exist. When running mysqld I checked /var/run/mysqld/mysqld.pid
and it matched with the pid of the running service.
I tried reinstalling mysql and deleting all files and configs but to no avail.
What to do?
EDIT:
I added some echo statements in the /etc/init.d/mysqld file, particularly in the stop function:
stop(){
if [ ! -f "$mypidfile" ]; then
# not running; per LSB standards this is "ok"
action $"Stopping $prog: " /bin/true
return 0
fi
echo "beginning stop sequence"
MYSQLPID=`cat "$mypidfile"`
if [ -n "$MYSQLPID" ]; then
/bin/kill "$MYSQLPID" >/dev/null 2>&1
echo "killing pid $MYSQLPID"
ret=$?
if [ $ret -eq 0 ]; then
echo "return code $ret after kill attempt"
TIMEOUT="$STOPTIMEOUT"
echo "timeout is set to $STOPTIMEOUT"
while [ $TIMEOUT -gt 0 ]; do
/bin/kill -0 "$MYSQLPID" >/dev/null 2>&1 || break
sleep 1
let TIMEOUT=${TIMEOUT}-1
echo "timeout is now $TIMEOUT"
done
if [ $TIMEOUT -eq 0 ]; then
echo "Timeout error occurred trying to stop MySQL Daemon."
ret=1
action $"Stopping $prog: " /bin/false
else
echo "attempting to del lockfile: $lockfile"
rm -f $lockfile
rm -f "$socketfile"
action $"Stopping $prog: " /bin/true
fi
else
action $"Stopping $prog: " /bin/false
fi
else
# failed to read pidfile, probably insufficient permissions
action $"Stopping $prog: " /bin/false
ret=4
fi
return $ret
}
This is the result I get when I try to stop the service:
[root@server]# service mysqld stop
beginning stop sequence
killing pid 9145
return code 0 after kill attempt
timeout is set to 60
timeout is now 59
timeout is now 58
timeout is now 57
timeout is now 56
timeout is now 55
timeout is now 54
timeout is now 53
timeout is now 52
timeout is now 51
timeout is now 50
timeout is now 49
From looking at the code it seems to me that it will never break out of that while loop and will not be able to delete the lock file. Am I interpreting this wrong? I checked the same file on my other server and it uses the same code. I am dumbfounded.
EDIT: In the while loop part
/bin/kill -0 "$MYSQLPID" >/dev/null 2>&1 || break
For some reason it is not recognizing the return code. When service mysqld stop is called, the process has already been killed but not sure why it isn't allowing the loop to break out.
EDIT:
Further testing shows some weird behavior between calling /bin/kill
and just calling kill
, they apparently return different codes, why??????:
[root@server]# /bin/kill 25200
kill 25200: No such process
[user@server]# echo ${?}
0
[root@server]# kill 25200
-bash: kill: (25200) - No such process
[root@server]# echo ${?}
1
EDIT: I logged on as a non-root user and tried executing "kill" and "/bin/kill" with a surprising results:
[notroot@server ~]$ kill -0 23232
-bash: kill: (23232) - No such process
[notroot@server ~]$ echo $?
1
[notroot@server ~]$ /bin/kill -0 23232
kill 23232: No such process
(No info could be read for "-p": geteuid()=501 but you should be root.)
[notroot@server ~]$ echo $?
0
The "No info could be read" error does not show up in my other servers when executing kill and bin/kill as a nonroot user.
EDIT: Added logging described by quanta and also checked out the mysql log:
After a start and stop, mysql log shows this:
110918 00:11:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
110918 0:11:28 [Note] Plugin 'FEDERATED' is disabled.
110918 0:11:28 InnoDB: Initializing buffer pool, size = 16.0M
110918 0:11:28 InnoDB: Completed initialization of buffer pool
110918 0:11:29 InnoDB: Started; log sequence number 0 44233
110918 0:11:29 [Warning] 'user' entry 'root@server' ignored in --skip-name-resolve mode.
110918 0:11:29 [Warning] 'user' entry '@server' ignored in --skip-name-resolve mode.
110918 0:11:29 [Note] Event Scheduler: Loaded 0 events
110918 0:11:29 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.58-ius' socket: '/var/lib/mysql/mysql.sock' port: 3306 Distributed by The IUS Community Project
110918 0:11:34 [Note] /usr/libexec/mysqld: Normal shutdown
110918 0:11:34 [Note] Event Scheduler: Purging the queue. 0 events
110918 0:11:34 InnoDB: Starting shutdown...
110918 0:11:39 InnoDB: Shutdown completed; log sequence number 0 44233
110918 0:11:39 [Note] /usr/libexec/mysqld: Shutdown complete
110918 00:11:39 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
Then in the tmp/mysql.log:
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
kill 23080: No such process
I stopped the stop process midway so I don't have to wait for the timeout. Looks like the process was killed. The problem, I think, is still with the different return codes from "kill"
and "/bin/kill"
First things first: a very well-done, systematic and thorough debugging, good job.
On my RHEL 5.6 box I always get a return code of 1 if I try to kill a non-existing pid. I tried as both root and a non-privileged user, both with full path and a with just the command name. I also get only terse kill XXX: No such process
, with no elaborate error messages.
It may be a good idea to run rpm -Vv util-linux
and see if somebody didn't replace /bin/kill
with a new and improved version. Even if rpm verification says the file is pristine, I'd try renaming /bin/kill
and copying over a binary from a working machine. If the file replacement helps and you don't uncover a legitimate the source of the change, then regardless of output of rpm verification I'd assume the machine was compromised.