Identify data loss due to logging failure on Exchange 2013 database(s)
Over the few weeks our Exchange server has had problems with its backup jobs failing to complete which has caused our normally ~empty log drives to fill up to the point where Exchange has dismounted databases and logged various errors regarding replaying log files. Lamentably, nobody in the backup team did their job properly, so over the weekend we had a failure situation which dismounted ~40 databases due to there being ~100GB of logs, which generally sit at ~3GB. This caused everyone working the weekend to not look at the history of the issue, and rather than reach out to contact anyone else, enable circular logging after everyone on the team was instructed not to, remount all the databases and call it a day.
We've not heard of any data loss from any users yet, but the fact that all databases encountered this, and that there are complaints about logging failures, replay failures, and unexpected dismounts I'm concerned that there may be some.
Aside from firing the backup team, the weekend monitors and the admin that decided to enable circular logging after a significant amount of time between backups without saving the logs anywhere in case of a need to restore from backup and get whatever we could back, what is my best course of action to determine if we lost anything?
Are there particular events which may be buried in the 3,000,000 log entries that span the six hour long section while this was going on? Is performing an integrity check recommended? Defrag, on or offline?
On the Exchange server the following is typically what happened, I've stripped event source and ID because everything seems to be generic and of little assistance in determining if things actually went super south, or just ruined my Monday:
At 'TIME' the Microsoft Exchange Information Store Database 'DATABASE' copy on this server experienced a serious error which caused it to terminate its functional activity. The error returned by the remount attempt was "There is only one copy of this mailbox database (DATABASE). Automatic recovery is not available.". Consult the event log on the server for other storage and "ExchangeStoreDb" events for more specific information about the failures.
Information Store - DATABASE (9564) DATABASE: An attempt to write to the file "F:\Logs\DATABASE\E0Etmp.log" at offset 1048576 (0x0000000000100000) for 0 (0x00000000) bytes failed after 0.000 seconds with system error 112 (0x00000070): "There is not enough space on the disk. ". The write operation will fail with error -1808 (0xfffff8f0). If this error persists then the file may be damaged and may need to be restored from a previous backup.
Information Store - DATABASE (9564) DATABASE: Unable to create a new logfile because the database cannot write to the log drive. The drive may be read-only, out of disk space, misconfigured, or corrupted. Error -529.
Information Store - DATABASE (9564) DATABASE: The logfile sequence in "F:\Logs\DATABASE\" has been halted due to a fatal error. No further updates are possible for the databases that use this logfile sequence. Please correct the problem and restart or restore from backup.
Information Store - DATABASE (9564) DATABASE: Database recovery/restore failed with unexpected error -510.
The Microsoft Exchange Mailbox Replication service was unable to process jobs in a mailbox database. Database: DATABASE Error: MapiExceptionMdbOffline: Unable to open message store. (hr=0x80004005, ec=1142) Diagnostic context:
At 'TIME', the copy of database 'DATABASE' on this server encountered an error during the mount operation. For more information, consult the Event log on the server for "ExchangeStoreDb" or "MSExchangeRepl" events. The mount operation will be tried again automatically.
This is a standalone server, so the only one copy errors seem to be expected. There are also numerous client access errors logged during the time that this was happening, which I've omitted.
Solution 1:
I tend to think that you've minimal data loss, if any. That sounds pretty extreme, I realize, but the basis for my opinion is that new data stopped coming in when the disks filled. Even if you did lose data, what would be lost would almost certainly be data that was received by the server immediately prior to the disk-full condition.
At the time the disk filled the Extensible Storage Engine (ESE) would flush log data to each database's reserve transaction logs before dismounting the database.
Exchange dismounted the stores. Any mail coming in from the Internet would be queued by the your secondary MX (or the sender if you have none) and sent later, or NDR'd (in which case the sender would be aware of the failure) by the sender's server. I suppose there's a chance that a sender would drop a message from queue w/o NDR'ing it, but that's hardly your problem.
Outlook clients would be unable to connect to their Information Store databases, so no new email from internal clients was likely generated to be lost.
You mentioned transaction log replay failures. That does sound a bit disturbing, but without knowing the extent of those failures it's hard to say. Because of the nature of transaction log replays (that is, committing recently-written uncommitted data to the database) the chance of replay failures having an effect on older stored data is fairly low. If users aren't seeing problems with the newest data in their mailboxes they probably aren't going to later on.
There really isn't a database fragmentation-related concern related to the disk full condition. The write patterns to the database would not change because the transaction log volume filled. Online defragmentation will still happen as it does normally. Offline defragmentation is normally not needed or recommended by Microsoft anymore.
It is conceivable, if the databases were stored on the volumes that filled that the EDB files could have filesystem fragmentation but, generally speaking, Microsoft does not recommend defragmenting volumes that hold Exchange databases. You could hit your .EDB files with contig.exe
to analyze their fragmentation if you wanted to be sure.
ESE is really robust. I think you're probably okay.