Zookeeper node is not able to load database on production server anymore
I am running a zookeeper ensemble of 3 on three different ubuntu 14.04 nodes. The setup used to work OK, but now I noticed the zk1 is not operational. It does also not restart:
:/home/www$ sudo /etc/init.d/zookeeper status
* zookeeper is not running
:/home/www$ ps -ef | grep zoo
zookeep+ 11465 1 0 16:15 ? 00:00:00 /usr/bin/java -cp /etc/zookeeper/conf:/usr/share/java/jline.jar:/usr/share/java/log4j-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/xmlParserAPIs.jar:/usr/share/java/netty.jar:/usr/share/java/slf4j-api.jar:/usr/share/java/slf4j-log4j12.jar:/usr/share/java/zookeeper.jar -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,ROLLINGFILE org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zokeeper/conf/zoo.cfg
merlin 11492 25021 0 16:15 pts/1 00:00:00 grep --color=auto zoo
:/home/www$ echo stat | nc zk1 2181
:/home/www$ echo stat | nc zk2 2181
Zookeeper version: 3.4.5--1, built on 06/10/2013 17:26 GMT
Clients:
/10.0.0.103:42841[1](queued=0,recved=33936,sent=33950)
/10.0.0.101:38370[0](queued=0,recved=1,sent=0)
Latency min/avg/max: 0/0/14
Received: 37987
Sent: 38069
Connections: 2
Outstanding: 0
Zxid: 0x1600000983
Mode: follower
Node count: 202
The startup script is not restarting zookeper nor is it stoping it. Status claims that it is not running. However, zk2 and zk3 are running but their startup script claims also that zookeeper is not running but it is.
This is a production server and so far solr is operational but I woule like to fix this a.s.a.p. Any help is greatly appreciated.
Here is the logfile, running a crazy amount of permanent errors. Something seems to be wrong with the dabase. I had to remove the log entries of solr the other day which might be the cause of it.
Zookeeper logfile:
2016-01-15 16:30:02,618 - INFO [main:QuorumPeerConfig@101] - Reading configuration from: /etc/zookeeper/conf/zoo.cfg
2016-01-15 16:30:02,620 - INFO [main:QuorumPeerConfig@334] - Defaulting to majority quorums
2016-01-15 16:30:02,622 - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2016-01-15 16:30:02,622 - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 0
2016-01-15 16:30:02,623 - INFO [main:DatadirCleanupManager@101] - Purge task is not scheduled.
2016-01-15 16:30:02,628 - INFO [main:QuorumPeerMain@127] - Starting quorum peer
2016-01-15 16:30:02,635 - INFO [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2016-01-15 16:30:02,644 - INFO [main:QuorumPeer@913] - tickTime set to 2000
2016-01-15 16:30:02,644 - INFO [main:QuorumPeer@933] - minSessionTimeout set to -1
2016-01-15 16:30:02,644 - INFO [main:QuorumPeer@944] - maxSessionTimeout set to -1
2016-01-15 16:30:02,644 - INFO [main:QuorumPeer@959] - initLimit set to 10
2016-01-15 16:30:02,663 - INFO [main:FileSnap@83] - Reading snapshot /var/lib/zookeeper/version-2/snapshot.1400009ce6
2016-01-15 16:30:02,677 - ERROR [main:Util@239] - Last transaction was partial.
2016-01-15 16:30:02,677 - ERROR [main:QuorumPeer@453] - Unable to load database on disk
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:132)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
2016-01-15 16:30:02,678 - ERROR [main:QuorumPeerMain@89] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:454)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:132)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
... 4 more
I removed now the snapshots and could restart zookeeper on this node, however it will not connect to the ensemble but show only one node up:
echo stat | nc zk1 2181
Zookeeper version: 3.4.5--1, built on 06/10/2013 17:26 GMT
Clients:
/10.0.0.101:57508[0](queued=0,recved=1,sent=0)
Latency min/avg/max: 0/0/0
Received: 1
Sent: 0
Connections: 1
Outstanding: 0
Zxid: 0x16000009a4
Mode: follower
Node count: 202
Last transaction was partial.
2016-01-15 16:30:02,677 - ERROR [main:QuorumPeer@453] - Unable to load database on disk
I had the same issue, it mainly occurs when the disk space is full or zookeeper fails to write into disk. One way to solve this issue is by clearing zookeeper log (/<base_path>/zookeeper/version-2/
).