Delete performance for LOB data in SQL Server

This question is related to this forum thread.

Running SQL Server 2008 Developer Edition on my workstation and an Enterprise Edition two-node virtual machine cluster where I refer to "alpha cluster".

The time it takes to delete rows with a varbinary(max) column is directly related to the length of the data in that column. That may sound intuitive at first, but after investigation, it clashes with my understanding of how SQL Server actually deletes rows in general and deals with this kind of data.

The problem stems from a delete timeout (> 30 seconds) issue we are seeing in our .NET web application, but I have simplified it for the sake of this discussion.

When a record is deleted, SQL Server marks it as a ghost to be cleaned up by a Ghost Cleanup Task at a later time after the transaction commits (see Paul Randal's blog). In a test deleting three rows with 16 KB, 4 MB, and 50 MB data in a varbinary(max) column, respectively, I see this happening on the page with the in-row portion of the data, as well as in the transaction log.

What seems odd to me is that X locks are placed on all of the LOB data pages during the delete, and the pages are deallocated in the PFS. I see this in the transaction log, as well as with sp_lock and the results of the dm_db_index_operational_stats DMV (page_lock_count).

This creates an I/O bottleneck on my workstation and our alpha cluster if those pages are not already in the buffer cache. In fact, the page_io_latch_wait_in_ms from the same DMV is practically the entire duration of the delete, and the page_io_latch_wait_count corresponds with the number of locked pages. For the 50 MB file on my workstation, this translates to over 3 seconds when starting with an empty buffer cache (checkpoint / dbcc dropcleanbuffers), and I have no doubt it would be longer for heavy fragmentation and under load.

I tried to make sure that it wasn't just allocating space in the cache taking up that time. I read in 2 GB of data from other rows before executing the delete instead of the checkpoint method, which is more than is allotted to the SQL Server process. Not sure if that is a valid test or not, since I don't know how SQL Server shuffles the data around. I assumed it would always push out the old in favor of the new.

Further, it doesn't even modify the pages. This I can see with dm_os_buffer_descriptors. The pages are clean after the delete, while the number of modified pages is less than 20 for all three small, medium, and large deletes. I also compared output of DBCC PAGE for a sampling of the looked up pages, and there were no changes (only the ALLOCATED bit was removed from PFS). It just deallocates them.

To further prove that the page lookups/deallocations are causing the issue, I tried the same test using a filestream column instead of vanilla varbinary(max). The deletes were constant time, regardless of LOB size.

So, first my academic questions:

  1. Why does SQL Server need to lookup all of the LOB data pages in order to X lock them? Is that just a detail of how locks are represented in memory (stored with the page somehow)? This makes the I/O impact depend strongly on data size if not completely cached.
  2. Why the X locks at all, just to deallocate them? Isn't it sufficient to lock just the index leaf with the in-row portion, since the deallocation does not need to modify the pages themselves? Is there some other way to get at the LOB data that the lock protects against?
  3. Why deallocate the pages up front at all, given that there is already a background task dedicated to this type of work?

And maybe more important, my practical question:

  • Is there any way to make deletes operate differently? My goal is constant time deletes regardless of size, similar to filestream, where any cleanup happens in the background after the fact. Is it a configuration thing? Am I storing things strangely?

Here is how to reproduce the described test (executed through SSMS query window):

CREATE TABLE [T] (
    [ID] [uniqueidentifier] NOT NULL PRIMARY KEY,
    [Data] [varbinary](max) NULL
)

DECLARE @SmallID uniqueidentifier
DECLARE @MediumID uniqueidentifier
DECLARE @LargeID uniqueidentifier

SELECT @SmallID = NEWID(), @MediumID = NEWID(), @LargeID = NEWID()
-- May want to keep these IDs somewhere so you can use them in the deletes without var declaration

INSERT INTO [T] VALUES (@SmallID, CAST(REPLICATE(CAST('a' AS varchar(max)), 16 * 1024) AS varbinary(max)))
INSERT INTO [T] VALUES (@MediumID, CAST(REPLICATE(CAST('a' AS varchar(max)), 4 * 1024 * 1024) AS varbinary(max)))
INSERT INTO [T] VALUES (@LargeID, CAST(REPLICATE(CAST('a' AS varchar(max)), 50 * 1024 * 1024) AS varbinary(max)))

-- Do this before test
CHECKPOINT
DBCC DROPCLEANBUFFERS
BEGIN TRAN

-- Do one of these deletes to measure results or profile
DELETE FROM [T] WHERE ID = @SmallID
DELETE FROM [T] WHERE ID = @MediumID
DELETE FROM [T] WHERE ID = @LargeID

-- Do this after test
ROLLBACK

Here are some results from profiling the deletes on my workstation:

| Column Type | Delete Size | Duration (ms) | Reads | Writes | CPU |
--------------------------------------------------------------------
| VarBinary   | 16 KB       | 40            | 13    | 2      | 0   |
| VarBinary   | 4 MB        | 952           | 2318  | 2      | 0   |
| VarBinary   | 50 MB       | 2976          | 28594 | 1      | 62  |
--------------------------------------------------------------------
| FileStream  | 16 KB       | 1             | 12    | 1      | 0   |
| FileStream  | 4 MB        | 0             | 9     | 0      | 0   |
| FileStream  | 50 MB       | 1             | 9     | 0      | 0   |

We can't necessarily just use filestream instead because:

  1. Our data size distribution doesn't warrant it.
  2. In practice, we add data in many chunks, and filestream doesn't support partial updates. We would need to design around this.

Update 1

Tested a theory that the data is being written to the transaction log as part of the delete, and this does not seem to be the case. Am I testing for this incorrectly? See below.

SELECT MAX([Current LSN]) FROM fn_dblog(NULL, NULL)
--0000002f:000001d9:0001

BEGIN TRAN
DELETE FROM [T] WHERE ID = @ID

SELECT
    SUM(
        DATALENGTH([RowLog Contents 0]) +
        DATALENGTH([RowLog Contents 1]) +
        DATALENGTH([RowLog Contents 3]) +
        DATALENGTH([RowLog Contents 4])
    ) [RowLog Contents Total],
    SUM(
        DATALENGTH([Log Record])
    ) [Log Record Total]
FROM fn_dblog(NULL, NULL)
WHERE [Current LSN] > '0000002f:000001d9:0001'

For a file over 5 MB in size, this returned 1651 | 171860.

Further, I would expect the pages themselves to be dirty if data were written to the log. Only the deallocations seem to be logged, which matches what is dirty after the delete.

Update 2

I did get a response back from Paul Randal. He affirmed the fact that it has to read all of the pages in order to traverse the tree and find which pages to deallocate, and stated that there is no other way to look up which pages. This is a half answer to 1 & 2 (though doesn't explain the need for locks on out-of-row data, but that is small potatoes).

Question 3 is still open: Why deallocate the pages up front if there is already a background task to do cleanup for deletes?

And of course, the all important question: Is there a way to directly mitigate (i.e. not work around) this size-dependent delete behavior? I would think this would be a more common issue, unless we're really the only ones storing and deleting 50 MB rows in SQL Server? Does everyone else out there work around this with some form of a garbage collection job?


Solution 1:

I can't say why exactly it would be so much more inefficient to delete a VARBINARY(MAX) compared to file stream but one idea you could consider if you're just trying to avoid time outs from your web application when deleting these LOBS. You could store the VARBINARY(MAX) values in a separate table (lets call it tblLOB) that is referenced by the original table (lets call this tblParent).

From here when you delete a record you can just delete it from the parent record and then have an occasional garbage collection process to go in and clean up the records in the LOB table. There may be additional hard drive activity during this garbage collection process but it will at least be separate from the front end web and can be performed during non-peak times.